An Introduction to 
MATHEMATICAL STATISTICS 
and Its Applications 


) 


RICHARD J. LARSEN 


MORRIS L. MARX 


AN INTRODUCTION TO 
MATHEMATICAL STATISTICS 
AND ITS APPLICATIONS 


Fifth Edition 


RICHARD J. LARSEN 
Vanderbilt University 


Morris L. MARx 
University of West Florida 


Prentice Hall 
Boston Columbus Indianapolis New York San Francisco 
Upper Saddle River Amsterdam Cape Town Dubai 
London Madrid Milan Munich’ Paris Montréal 
Toronto Delhi Mexico City Sao Paulo Sydney 
Hong Kong Seoul Singapore Taipei Tokyo 


Editor in Chief: Deirdre Lynch 

Acquisitions Editor: Christopher Cummings 

Associate Editor: Christina Lepre 

Assistant Editor: Dana Jones 

Senior Managing Editor: Karen Wernholm 

Associate Managing Editor: Tamela Ambush 

Senior Production Project Manager: Peggy McMahon 
Senior Design Supervisor: Andrea Nix 

Cover Design: Beth Paquin 

Interior Design: Tamara Newnam 

Marketing Manager: Alex Gay 

Marketing Assistant: Kathleen DeChavez 

Senior Author Support/Technology Specialist: Joe Vetere 
Manufacturing Manager: Evelyn Beaton 

Senior Manufacturing Buyer: Carol Melville 

Production Coordination, Technical Illustrations, and Composition: Integra Software Services, Inc. 
Cover Photo: © Jason Reed/Getty Images 


Many of the designations used by manufacturers and sellers to distinguish their products are claimed as 
trademarks. Where those designations appear in this book, and Pearson was aware of a trademark 
claim, the designations have been printed in initial caps or all caps. 


Library of Congress Cataloging-in-Publication Data 


Larsen, Richard J. 
An introduction to mathematical statistics and its applications / 
Richard J. Larsen, Morris L. Marx.—Sth ed. 
p.cm. 
Includes bibliographical references and index. 
ISBN 978-0-321-69394-5 
1. Mathematical statistics—Textbooks. I. Marx, Morris L. II. Title. 
QA276.L314 2012 
519.5—dc22 
2010001387 


Copyright © 2012, 2006, 2001, 1986, and 1981 by Pearson Education, Inc. All rights reserved. No part of 
this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any 
means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written 
permission of the publisher. Printed in the United States of America. For information on obtaining 
permission for use of material in this work, please submit a written request to Pearson Education, Inc., 
Rights and Contracts Department, 501 Boylston Street, Suite 900, Boston, MA 02116, fax your request 
to 617-671-3447, or e-mail at http://www.pearsoned.com/legal/permissions.htm. 


Prentice Hall 
is an imprint of 
123456789 10—EB—14 13 12 1110 


PEARSON ISBN-13: 978-0-321-69394-5 


ISBN-10: 0-321-69394-9 


Nn nen 


www.pearsonhighered.com 


TABLE OF CONTENTS 


Preface viii 


1 INTRODUCTION | 


1.1 
1.2 
1.3 
1.4 


An Overview 1 

Some Examples 2 

A Brief History 7 
AChapter Summary 14 


3 PROBABILITY 16 


2.1 
2.2 
2.3 
2.4 
2.5 
2.6 
2.7 
2.8 


Introduction 16 

Sample Spaces and the Algebra of Sets 18 

The Probability Function 27 

Conditional Probability 32 

Independence 53 

Combinatorics 67 

Combinatorial Probability 90 

Taking a Second Look at Statistics (Monte Carlo Techniques) 


3 RANDOM VARIABLES 102 


99 


3.1 
3.2 
3.3 
3.4 
3.5 
3.6 
3.7 
3.8 
3.9 
3.10 
3.11 
3.12 
3.13 


Introduction 102 

Binomial and Hypergeometric Probabilities 103 
Discrete Random Variables 118 

Continuous Random Variables 129 

Expected Values 139 

The Variance 155 

Joint Densities 162 

Transforming and Combining Random Variables 176 
Further Properties of the Mean and Variance 183 
Order Statistics 193 

Conditional Densities 200 

Moment-Generating Functions 207 


Taking a Second Look at Statistics (Interpreting Means) 216 


Appendix 3.A.1 Minitab Applications 218 


iv Table of Contents 


4 SPECIAL DISTRIBUTIONS 221 


4.1 
4.2 
4.3 
4.4 
4.5 
4.6 
4.7 


Introduction 221 

The Poisson Distribution 222 

The Normal Distribution 239 

The Geometric Distribution 260 

The Negative Binomial Distribution 262 
The Gamma Distribution 270 


Taking a Second Look at Statistics (Monte Carlo 
Simulations) 274 


Appendix 4.A.I Minitab Applications 278 
Appendix 4.A.2 A Proof of the Central Limit Theorem 280 


5 ESTIMATION 281 


5.1 
5.2 


5.3 


5.4 
5.5 


5.6 
5.7 
5.8 
5.9 


Introduction 281 


Estimating Parameters: The Method of Maximum Likelihood and 
the Method of Moments 284 


Interval Estimation 297 
Properties of Estimators 312 


Minimum-Variance Estimators: The Cramér-Rao Lower 
Bound 320 


Sufficient Estimators 323 
Consistency 330 
Bayesian Estimation 333 


Taking a Second Look at Statistics (Beyond Classical 
Estimation) 345 


Appendix 5.A.1 Minitab Applications 346 


6 HyPoruHEsIs TESTING 350 


6.1 
6.2 
6.3 
6.4 
6.5 
6.6 


Introduction 350 

The Decision Rule 351 

Testing Binomial Data—H,: p=p, 361 

Type land Type Il Errors 366 

A Notion of Optimality: The Generalized Likelihood Ratio 379 


Taking a Second Look at Statistics (Statistical Significance versus 
“Practical” Significance) 382 


Table of Contents 


yp INFERENCES BASED ON THE NORMAL 
DISTRIBUTION 385 


T.1 
7.2 


7.3 


7.4 
7.5 
7.6 


Introduction 385 


Comparing nA and mE 386 


shee ae : Y-m 
Deriving the Distribution of Tyr 388 


Drawing Inferences About x 394 
Drawing Inferences About o? 410 


Taking a Second Look at Statistics (Type Il Error) 418 


Appendix 7.A.I Minitab Applications 421 


Appendix 7.A.2 Some Distribution Results for Yand S$? 423 


Appendix 7.A.3 A Proof that the One-Sample t Test isa GLRT 425 
Appendix 7.A.4 A Proof of Theorem 7.5.2 427 


8 Types OF Data: A BRIEF OVERVIEW 430 


8.1 
8.2 
8.3 


Introduction 430 
Classifying Data 435 


Taking a Second Look at Statistics (Samples Are Not 
“Valid”!) 455 


9 TWo-SAMPLE INFERENCES 457 


9.1 
9.2 
9.3 
9.4 
9.5 
9.6 


Introduction 457 

Testing Hy: ux= Ly 458 

Testing Hy: Oy = o7—The FTest 471 

Binomial Data: Testing H): px =py 476 

Confidence Intervals for the Two-Sample Problem 481 


Taking a Second Look at Statistics (Choosing Samples) 487 


Appendix 9.A.1 A Derivation of the Two-Sample t Test (A Proof of 
Theorem 9.2.2) 488 


Appendix 9.A.2 Minitab Applications 491 


10 GOoDNESSs-OF-FiT TESTS 493 


10.1 
10.2 
10.3 
10.4 
10.5 


Introduction 493 

The Multinomial Distribution 494 
Goodness-of-Fit Tests: All Parameters Known 499 
Goodness-of-Fit Tests: Parameters Unknown 509 


Contingency Tables 519 


vi Table of Contents 


10.6 Taking a Second Look at Statistics (Outliers) 529 
Appendix 10.A.1 Minitab Applications 531 


1 1 REGRESSION 532 


11.1 Introduction 532 

11.2 The Method of Least Squares 533 
11.3 The Linear Model 555 

11.4 Covariance and Correlation 575 

11.5 The Bivariate Normal Distribution 582 


11.6 Taking a Second Look at Statistics (How Not to Interpret 
the Sample Correlation Coefficient) 589 


Appendix 11.A.1 Minitab Applications 590 
Appendix 11.A.2 A Proof of Theorem 11.3.3 592 


12 THE ANALYSIS OF VARIANCE 595 


12.1 Introduction 595 

12.2 TheF Test 597 

12.3 Multiple Comparisons: Tukey’s Method 608 
12.4 Testing Subhypotheses with Contrasts 611 
12.5 Data Transformations 617 


12.6 Taking a Second Look at Statistics (Putting the Subject of 
Statistics Together—The Contributions of Ronald A. Fisher) 619 


Appendix 12.A.1 Minitab Applications 621 
Appendix 12.A.2 A Proof of Theorem 12.2.2 624 


Appendix 12.A.3 The Distribution of Poe When H, is True 624 


13 RANDOMIZED BLOCK DESIGNS 629 


13.1 Introduction 629 
13.2 The F Test for a Randomized Block Design 630 
13.3 The Pairedt Test 642 


13.4 Taking a Second Look at Statistics (Choosing between a 
Two-Sample t Test and a Paired t Test) 649 


Appendix 13.A.1 Minitab Applications 653 


1 4 NonparAMETRIC STATISTICS 655 


14.1 Introduction 656 
14.2 The Sign Test 657 


Table of Contents 


14.3 Wilcoxon Tests 662 

14.4 The Kruskal-Wallis Test 677 

14.5 The Friedman Test 682 

14.6 Testing forRandomness 684 

14.7 Taking a Second Look at Statistics (Comparing Parametric 


and Nonparametric Procedures) 689 
Appendix 14.A.1 Minitab Applications 693 
Appendix: Statistical Tables 696 
Answers to Selected Odd-Numbered Questions 723 


Bibliography 74s 


Index 753 


Vii 


PREFACE 


viii 


The first edition of this text was published in 1981. Each subsequent revision since 
then has undergone more than a few changes. Topics have been added, com- 
puter software and simulations introduced, and examples redone. What has not 
changed over the years is our pedagogical focus. As the title indicates, this book 
is an introduction to mathematical statistics and its applications. Those last three 
words are not an afterthought. We continue to believe that mathematical statistics 
is best learned and most effectively motivated when presented against a back- 
drop of real-world examples and all the issues that those examples necessarily 
raise. 

We recognize that college students today have more mathematics courses to 
choose from than ever before because of the new specialties and interdisciplinary 
areas that continue to emerge. For students wanting a broad educational experi- 
ence, an introduction to a given topic may be all that their schedules can reasonably 
accommodate. Our response to that reality has been to ensure that each edition of 
this text provides a more comprehensive and more usable treatment of statistics 
than did its predecessors. 

Traditionally, the focus of mathematical statistics has been fairly narrow—the 
subject’s objective has been to provide the theoretical foundation for all of the var- 
ious procedures that are used for describing and analyzing data. What it has not 
spoken to at much length are the important questions of which procedure to use 
in a given situation, and why. But those are precisely the concerns that every user 
of statistics must inevitably confront. To that end, adding features that can create 
a path from the theory of statistics to its practice has become an increasingly high 
priority. 


New to This Edition 


¢ Beginning with the third edition, Chapter 8, titled “Data Models,” was added. 
It discussed some of the basic principles of experimental design, as well as some 
guidelines for knowing how to begin a statistical analysis. In this fifth edition, the 
Data Models (“Types of Data: A Brief Overview”) chapter has been substantially 
rewritten to make its main points more accessible. 

¢ Beginning with the fourth edition, the end of each chapter except the first fea- 
tured a section titled “Taking a Second Look at Statistics.” Many of these sections 
describe the ways that statistical terminology is often misinterpreted in what we 
see, hear, and read in our modern media. Continuing in this vein of interpre- 
tation, we have added in this fifth edition comments called “About the Data.” 
These sections are scattered throughout the text and are intended to encourage 
the reader to think critically about a data set’s assumptions, interpretations, and 
implications. 

e Many examples and case studies have been updated, while some have been 
deleted and others added. 

¢ Section 3.8, “Transforming and Combining Random Variables,’ has been 
rewritten. 


Preface ix 


Section 3.9, “Further Properties of the Mean and Variance,” now includes a dis- 
cussion of covariances so that sums of random variables can be dealt with in more 
generality. 

¢ Chapter 5, “Estimation,” now has an introduction to bootstrapping. 

e Chapter 7, “Inferences Based on the Normal Distribution,” has new material on 
the noncentral t distribution and its role in calculating Type II error probabilities. 
Chapter 9, “Two-Sample Inferences,” has a derivation of Welch’s approx- 
imation for testing the differences of two means in the case of unequal 
variances. 


We hope that the changes in this edition will not undo the best features of the 
first four. What made the task of creating the fifth edition an enjoyable experience 
was the nature of the subject itself and the way that it can be beautifully elegant and 
down-to-earth practical, all at the same time. Ultimately, our goal is to share with 
the reader at least some small measure of the affection we feel for mathematical 
statistics and its applications. 


Supplements 


Instructor’s Solutions Manual. This resource contains worked-out solutions to 
all text exercises and is available for download from the Pearson Education 
Instructor Resource Center. 

Student Solutions Manual ISBN-10: 0-321-69402-3; ISBN-13: 978-0-321- 
69402-7. Featuring complete solutions to selected exercises, this is a great tool 
for students as they study and work through the problem material. 
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Chapter 


INTRODUCTION 


1.1 An Overview 
1.2. Some Examples 


1.3. A Brief History 
1.4 A Chapter Summary 


“Until the phenomena of any branch of knowledge have been submitted to 
measurement and number it cannot assume the status and dignity of a science.” 
—Francis Galton 


|.| An Overview 


Sir Francis Galton was a preeminent biologist of the nineteenth century. A passion- 
ate advocate for the theory of evolution (his nickname was “Darwin’s bulldog”), 
Galton was also an early crusader for the study of statistics and believed the subject 
would play a key role in the advancement of science: 


Some people hate the very name of statistics, but I find them full of beauty and inter- 
est. Whenever they are not brutalized, but delicately handled by the higher methods, 
and are warily interpreted, their power of dealing with complicated phenomena is 
extraordinary. They are the only tools by which an opening can be cut through the 
formidable thicket of difficulties that bars the path of those who pursue the Science 
of man. 


Did Galton’s prediction come to pass? Absolutely —try reading a biology journal 
or the analysis of a psychology experiment before taking your first statistics course. 
Science and statistics have become inseparable, two peas in the same pod. What the 
good gentleman from London failed to anticipate, though, is the extent to which all 
of us—not just scientists—have become enamored (some would say obsessed) with 
numerical information. The stock market is awash in averages, indicators, trends, 
and exchange rates; federal education initiatives have taken standardized testing to 
new levels of specificity; Hollywood uses sophisticated demographics to see who’s 
watching what, and why; and pollsters regularly tally and track our every opinion, 
regardless of how irrelevant or uninformed. In short, we have come to expect every- 
thing to be measured, evaluated, compared, scaled, ranked, and rated—and if the 
results are deemed unacceptable for whatever reason, we demand that someone or 
something be held accountable (in some appropriately quantifiable way). 

To be sure, many of these efforts are carefully carried out and make perfectly 
good sense; unfortunately, others are seriously flawed, and some are just plain 
nonsense. What they all speak to, though, is the clear and compelling need to know 
something about the subject of statistics, its uses and its misuses. 


2 Chapter 1 Introduction 


This book addresses two broad topics—the mathematics of statistics and the 
practice of statistics. The two are quite different. The former refers to the probabil- 
ity theory that supports and justifies the various methods used to analyze data. For 
the most part, this background material is covered in Chapters 2 through 7. The key 
result is the central limit theorem, which is one of the most elegant and far-reaching 
results in all of mathematics. (Galton believed the ancient Greeks would have per- 
sonified and deified the central limit theorem had they known of its existence.) Also 
included in these chapters is a thorough introduction to combinatorics, the math- 
ematics of systematic counting. Historically, this was the very topic that launched 
the development of probability in the first place, back in the seventeenth century. 
In addition to its connection to a variety of statistical procedures, combinatorics is 
also the basis for every state lottery and every game of chance played with a roulette 
wheel, a pair of dice, or a deck of cards. 

The practice of statistics refers to all the issues (and there are many!) that arise 
in the design, analysis, and interpretation of data. Discussions of these topics appear 
in several different formats. Following most of the case studies throughout the text is 
a feature entitled “About the Data.” These are additional comments about either the 
particular data in the case study or some related topic suggested by those data. Then 
near the end of most chapters is a Taking a Second Look at Statistics section. Several 
of these deal with the misuses of statistics—specifically, inferences drawn incorrectly 
and terminology used inappropriately. The most comprehensive data-related discus- 
sion comes in Chapter 8, which is devoted entirely to the critical problem of knowing 
how to start a statistical analysis—that is, knowing which procedure should be used, 
and why. 

More than a century ago, Galton described what he thought a knowledge of 
statistics should entail. Understanding “the higher methods,” he said, was the key 
to ensuring that data would be “delicately handled” and “warily interpreted.” The 
goal of this book is to make that happen. 


1.2 Some Examples 


Statistical methods are often grouped into two broad categories — descriptive statis- 
tics and inferential statistics. The former refers to all the various techniques for 
summarizing and displaying data. These are the familiar bar graphs, pie charts, scat- 
terplots, means, medians, and the like, that we see so often in the print media. The 
much more mathematical inferential statistics are procedures that make generaliza- 
tions and draw conclusions of various kinds based on the information contained in 
a set of data; moreover, they calculate the probability of the generalizations being 
correct. 

Described in this section are three case studies. The first illustrates a very effec- 
tive use of several descriptive techniques. The latter two illustrate the sorts of 
questions that inferential procedures can help answer. 


Case Study 1.2.1 


Pictured at the top of Figure 1.2.1 is the kind of information routinely recorded 
by a seismograph — listed chronologically are the occurrence times and Richter 
magnitudes for a series of earthquakes. As raw data, the numbers are largely 


(Continued on next page) 
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meaningless: No patterns are evident, nor is there any obvious connection 
between the frequencies of tremors and their severities. 


Episode number Date Time Severity (Richter scale) 
217 6/19 4:53 P.M. 2.9 
218 7/2 6:07 AM. 3.1 
219 7/4 8:19 AM. 2.0 
220 8/7 1:10 AM. 4.1 
221 8/7 10:46 P.M. 3.6 
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Magnitude on Richter scale, R 


Figure 1.2.1 


Shown at the bottom of the figure is the result of applying several descrip- 
tive techniques to an actual set of seismograph data recorded over a period of 
several years in southern California (67). Plotted above the Richter (R) value of 
4.0, for example, is the average number (N) of earthquakes occurring per year 
in that region having magnitudes in the range 3.75 to 4.25. Similar points are 
included for R-values centered at 4.5, 5.0, 5.5, 6.0, 6.5, and 7.0. Now we can see 
that earthquake frequencies and severities are clearly related: Describing the 
(N, R)’s exceptionally well is the equation 


N = 80,338. 16e7!-98!% (1.2.1) 


which is found using a procedure described in Chapter 9. (Note: Geologists have 
shown that the model N = Boe*'® describes the (N, R) relationship all over the 
world. All that changes from region to region are the numerical values for fo 
and fj.) 


(Continued on next page) 
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(Case Study 1.2.1 continued) 


Notice that Equation 1.2.1 is more than just an elegant summary of the 
observed (N, R) relationship. Rather, it allows us to estimate the likelihood 
of future earthquake catastrophes for large values of R that have never been 
recorded. For example, many Californians worry about the “Big One,” a mon- 
ster tremor—say, R = 10.0—that breaks off chunks of tourist-covered beaches 
and sends them floating toward Hawaii. How often might we expect that to 
happen? Setting R = 10.0 in Equation 1.2.1 gives 


N =80,338. 16e7 1-980.) 
= 0.0002 earthquake per year 


which translates to a prediction of one such megaquake every five thousand 
years (= 1/0.0002). (Of course, whether that estimate is alarming or reassuring 
probably depends on whether you live in San Diego or Topeka... .) 


About the Data The megaquake prediction prompted by Equation 1.2.1 raises an 
obvious question: Why is the calculation that led to the model N = 80,338.16e71°8!8 
not considered an example of inferential statistics even though it did yield a pre- 
diction for R = 10? The answer is that Equation 1.2.1—by itself—does not tell us 
anything about the “error” associated with its predictions. In Chapter 11, a more 
elaborate probability method based on Equation 1.2.1 is described that does yield 
error estimates and qualifies as a bona fide inference procedure. 


Case Study 1.2.2 


Claims of disputed authorship can be very difficult to resolve. Speculation has 
persisted for several hundred years that some of William Shakespeare’s works 
were written by Sir Francis Bacon (or maybe Christopher Marlowe). And 
whether it was Alexander Hamilton or James Madison who wrote certain of 
the Federalist Papers is still an open question. Less well known is a controversy 
surrounding Mark Twain and the Civil War. 

One of the most revered of all American writers, Twain was born in 1835, 
which means he was twenty-six years old when hostilities between the North 
and South broke out. At issue is whether he was ever a participant in the war— 
and, if he was, on which side. Twain always dodged the question and took the 
answer to his grave. Even had he made a full disclosure of his military record, 
though, his role in the Civil War would probably still be a mystery because of 
his self-proclaimed predisposition to be less than truthful. Reflecting on his life, 
Twain made a confession that would give any would-be biographer pause: “I am 
an old man,” he said, “and have known a great many troubles, but most of them 
never happened.” 

What some historians think might be the clue that solves the mystery is a set 
of ten essays that appeared in 1861 in the New Orleans Daily Crescent. Signed 
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“Quintus Curtius Snodgrass,” the essays purported to chronicle the author’s 
adventures as a member of the Louisiana militia. Many experts believe that the 
exploits described actually did happen, but Louisiana field commanders had 
no record of anyone named Quintus Curtius Snodgrass. More significantly, the 
pieces display the irony and humor for which Twain was so famous. 

Table 1.2.1 summarizes data collected in an attempt (16) to use statistical 
inference to resolve the debate over the authorship of the Snodgrass letters. 
Listed are the proportions of three-letter words (1) in eight essays known to 
have been written by Mark Twain and (2) in the ten Snodgrass letters. 

Researchers have found that authors tend to have characteristic word- 
length profiles, regardless of what the topic might be. It follows, then, that if 
Twain and Snodgrass were the same person, the proportion of, say, three-letter 
words that they used should be roughly the same. The bottom of Table 1.2.1 
shows that, on the average, 23.2% of the words in a Twain essay were three 
letters long; the corresponding average for the Snodgrass letters was 21.0%. 

If Twain and Snodgrass were the same person, the difference between these 
average three-letter proportions should be close to 0: for these two sets of 
essays, the difference in the averages was 0.022 (= 0.232 — 0.210). How should 
we interpret the difference 0.022 in this context? Two explanations need to be 
considered: 


1. The difference, 0.022, is sufficiently small (i.e., close to 0) that it does not 
rule out the possibility that Twain and Snodgrass were the same person. 

or 

2. The difference, 0.022, is so large that the only reasonable conclusion is that 
Twain and Snodgrass were not the same person. 


Choosing between explanations 1 and 2 is an example of hypothesis testing, 
which is a very frequently encountered form of statistical inference. 

The principles of hypothesis testing are introduced in Chapter 6, and the 
particular procedure that applies to Table 1.2.1 first appears in Chapter 9. 
So as not to spoil the ending of a good mystery, we will defer unmasking 
Mr. Snodgrass until then. 


Table 1.2.1 

Twain Proportion QCS Proportion 

Sergeant Fathom letter 0.225 Letter I 0.209 

Madame Caprell letter 0.262 Letter IT 0.205 

Mark Twain letters in Letter III 0.196 

Territorial Enterprise Letter IV 0.210 

First letter 0.217 Letter V 0.202 
Second letter 0.240 Letter VI 0.207 
Third letter 0.230 Letter VII 0.224 
Fourth letter 0.229 Letter VIII 0.223 

First Innocents Abroad letter Letter IX 0.220 
First half 0.235 Letter X 0.201 
Second half 0.217 

Average: 0.232 0.210 
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Case Study 1.2.3 


It may not be made into a movie anytime soon, but the way that statistical infer- 
ence was used to spy on the Nazis in World War II is a pretty good tale. And it 
certainly did have a surprise ending! 

The story began in the early 1940s. Fighting in the European theatre was 
intensifying, and Allied commanders were amassing a sizeable collection of 
abandoned and surrendered German weapons. When they inspected those 
weapons, the Allies noticed that each one bore a different number. Aware of 
the Nazis’ reputation for detailed record keeping, the Allies surmised that each 
number represented the chronological order in which the piece had been man- 
ufactured. But if that was true, might it be possible to use the “captured” serial 
numbers to estimate the total number of weapons the Germans had produced? 

That was precisely the question posed to a group of government statisticians 
working out of Washington, D.C. Wanting to estimate an adversary’s manufac- 
turing capability was, of course, nothing new. Up to that point, though, the only 
sources of that information had been spies and traitors; using serial numbers 
was something entirely new. 

The answer turned out to be a fairly straightforward application of the prin- 
ciples that will be introduced in Chapter 5. If n is the total number of captured 
serial numbers and xmax is the largest captured serial number, then the estimate 
for the total number of items produced is given by the formula 


estimated output = [(7 + 1)/n]xmax — 1 (1.2.2) 


Suppose, for example, that n =5 tanks were captured and they bore the serial 
numbers 92, 14, 28, 300, and 146, respectively. Then xmax =300 and the estimated 
total number of tanks manufactured is 359: 


estimated output = [(5 + 1)/5]300 — 1 
= 359 


Did Equation 1.2.2 work? Better than anyone could have expected (proba- 
bly even the statisticians). When the war ended and the Third Reich’s “true” 
production figures were revealed, it was found that serial number estimates 
were far more accurate in every instance than all the information gleaned 
from traditional espionage operations, spies, and informants. The serial num- 
ber estimate for German tank production in 1942, for example, was 3400, a 
figure very close to the actual output. The “official” estimate, on the other 
hand, based on intelligence gathered in the usual ways, was a grossly inflated 
18,000 (64). 


About the Data Large discrepancies, like 3400 versus 18,000 for the tank estimates, 
were not uncommon. The espionage-based estimates were consistently erring on the 
high side because of the sophisticated Nazi propaganda machine that deliberately 
exaggerated the country’s industrial prowess. On spies and would-be adversaries, 
the Third Reich’s carefully orchestrated dissembling worked exactly as planned; on 
Equation 1.2.2, though, it had no effect whatsoever! 


Figure 1.3.1 


1.3 ABrief History 7 


1.3 A Brief History 


For those interested in how we managed to get to where we are (or who just want 
to procrastinate a bit longer), Section 1.3 offers a brief history of probability and 
statistics. The two subjects were not mathematical littermates—they began at dif- 
ferent times in different places for different reasons. How and why they eventually 
came together makes for an interesting story and reacquaints us with some towering 
figures from the past. 


Probability: The Early Years 


No one knows where or when the notion of chance first arose; it fades into our 
prehistory. Nevertheless, evidence linking early humans with devices for generating 
random events is plentiful: Archaeological digs, for example, throughout the ancient 
world consistently turn up a curious overabundance of astragali, the heel bones of 
sheep and other vertebrates. Why should the frequencies of these bones be so dis- 
proportionately high? One could hypothesize that our forebears were fanatical foot 
fetishists, but two other explanations seem more plausible: The bones were used for 
religious ceremonies and for gambling. 

Astragali have six sides but are not symmetrical (see Figure 1.3.1). Those found 
in excavations typically have their sides numbered or engraved. For many ancient 
civilizations, astragali were the primary mechanism through which oracles solicited 
the opinions of their gods. In Asia Minor, for example, it was customary in divination 
rites to roll, or cast, five astragali. Each possible configuration was associated with 
the name of a god and carried with it the sought-after advice. An outcome of (1, 3, 
3, 4, 4), for instance, was said to be the throw of the savior Zeus, and its appearance 
was taken as a sign of encouragement (34): 


One one, two threes, two fours 
The deed which thou meditatest, go do it boldly. 
Put thy hand to it. The gods have given thee 
favorable omens 
Shrink not from it in thy mind, for no evil 
shall befall thee. 


Sheep astragalus 
A (4, 4, 4, 6, 6), on the other hand, the throw of the child-eating Cronos, would send 
everyone scurrying for cover: 


Three fours and two sixes. God speaks as follows. 
Abide in thy house, nor go elsewhere, 
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Lest a ravening and destroying beast come nigh thee. 
For I see not that this business is safe. But bide 
thy time. 


Gradually, over thousands of years, astragali were replaced by dice, and the 
latter became the most common means for generating random events. Pottery dice 
have been found in Egyptian tombs built before 2000 B.c.; by the time the Greek 
civilization was in full flower, dice were everywhere. (Loaded dice have also been 
found. Mastering the mathematics of probability would prove to be a formidable 
task for our ancestors, but they quickly learned how to cheat!) 

The lack of historical records blurs the distinction initially drawn between div- 
ination ceremonies and recreational gaming. Among more recent societies, though, 
gambling emerged as a distinct entity, and its popularity was irrefutable. The Greeks 
and Romans were consummate gamblers, as were the early Christians (91). 

Rules for many of the Greek and Roman games have been lost, but we can 
recognize the lineage of certain modern diversions in what was played during the 
Middle Ages. The most popular dice game of that period was called hazard, the 
name deriving from the Arabic al zhar, which means “a die.” Hazard is thought 
to have been brought to Europe by soldiers returning from the Crusades; its rules 
are much like those of our modern-day craps. Cards were first introduced in the 
fourteenth century and immediately gave rise to a game known as Primero, an early 
form of poker. Board games such as backgammon were also popular during this 
period. 

Given this rich tapestry of games and the obsession with gambling that char- 
acterized so much of the Western world, it may seem more than a little puzzling 
that a formal study of probability was not undertaken sooner than it was. As we 
will see shortly, the first instance of anyone conceptualizing probability in terms 
of a mathematical model occurred in the sixteenth century. That means that more 
than 2000 years of dice games, card games, and board games passed by before 
someone finally had the insight to write down even the simplest of probabilistic 
abstractions. 

Historians generally agree that, as a subject, probability got off to a rocky start 
because of its incompatibility with two of the most dominant forces in the evolution 
of our Western culture, Greek philosophy and early Christian theology. The Greeks 
were comfortable with the notion of chance (something the Christians were not), 
but it went against their nature to suppose that random events could be quantified in 
any useful fashion. They believed that any attempt to reconcile mathematically what 
did happen with what should have happened was, in their phraseology, an improper 
juxtaposition of the “earthly plane” with the “heavenly plane.” 

Making matters worse was the antiempiricism that permeated Greek thinking. 
Knowledge, to them, was not something that should be derived by experimentation. 
It was better to reason out a question logically than to search for its explanation in a 
set of numerical observations. Together, these two attitudes had a deadening effect: 
The Greeks had no motivation to think about probability in any abstract sense, nor 
were they faced with the problems of interpreting data that might have pointed them 
in the direction of a probability calculus. 

If the prospects for the study of probability were dim under the Greeks, they 
became even worse when Christianity broadened its sphere of influence. The Greeks 
and Romans at least accepted the existence of chance. However, they believed their 
gods to be either unable or unwilling to get involved in matters so mundane as the 
outcome of the roll of a die. Cicero writes: 
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Nothing is so uncertain as a cast of dice, and yet there is no one who plays often who 
does not make a Venus-throw! and occasionally twice and thrice in succession. Then 
are we, like fools, to prefer to say that it happened by the direction of Venus rather 
than by chance? 


For the early Christians, though, there was no such thing as chance: Every event 
that happened, no matter how trivial, was perceived to be a direct manifestation of 
God’s deliberate intervention. In the words of St. Augustine: 


Nos eas causas quae dicuntur fortuitae ... non dicimus 
nullas, sed latentes; easque tribuimus vel veri Dei... 
(We say that those causes that are said to be by chance 
are not non-existent but are hidden, and we attribute 
them to the will of the true God...) 


Taking Augustine’s position makes the study of probability moot, and it makes 
a probabilist a heretic. Not surprisingly, nothing of significance was accomplished 
in the subject for the next fifteen hundred years. 

It was in the sixteenth century that probability, like a mathematical Lazarus, 
arose from the dead. Orchestrating its resurrection was one of the most eccentric 
figures in the entire history of mathematics, Gerolamo Cardano. By his own admis- 
sion, Cardano personified the best and the worst—the Jekyll and the Hyde—of 
the Renaissance man. He was born in 1501 in Pavia. Facts about his personal life 
are difficult to verify. He wrote an autobiography, but his penchant for lying raises 
doubts about much of what he says. Whether true or not, though, his “one-sentence” 
self-assessment paints an interesting portrait (127): 


Nature has made me capable in all manual work, it has given me the spirit of a 
philosopher and ability in the sciences, taste and good manners, voluptuousness, 
gaiety, it has made me pious, faithful, fond of wisdom, meditative, inventive, coura- 
geous, fond of learning and teaching, eager to equal the best, to discover new 
things and make independent progress, of modest character, a student of medicine, 
interested in curiosities and discoveries, cunning, crafty, sarcastic, an initiate in the 
mysterious lore, industrious, diligent, ingenious, living only from day to day, imper- 
tinent, contemptuous of religion, grudging, envious, sad, treacherous, magician and 
sorcerer, miserable, hateful, lascivious, obscene, lying, obsequious, fond of the prat- 
tle of old men, changeable, irresolute, indecent, fond of women, quarrelsome, and 
because of the conflicts between my nature and soul I am not understood even by 
those with whom I associate most frequently. 


Formally trained in medicine, Cardano’s interest in probability derived from his 
addiction to gambling. His love of dice and cards was so all-consuming that he is 
said to have once sold all his wife’s possessions just to get table stakes! Fortunately, 
something positive came out of Cardano’s obsession. He began looking for a math- 
ematical model that would describe, in some abstract way, the outcome of a random 
event. What he eventually formalized is now called the classical definition of prob- 
ability: If the total number of possible outcomes, all equally likely, associated with 
some action is n, and if m of those n result in the occurrence of some given event, 
then the probability of that event is m/n. If a fair die is rolled, there are n = 6 pos- 
sible outcomes. If the event “Outcome is greater than or equal to 5” is the one in 


! When rolling four astragali, each of which is numbered on four sides, a Venus-throw was having each of the 
four numbers appear. 
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Figure 1.3.2 


el 02 Outcomes greater 
than or equal to 
e 3 e4 5; probability = 2/6 
(#5 #6 ) 


Possible outcomes 


which we are interested, then m = 2 (the outcomes 5 and 6) and the probability of 
the event is z, or : (see Figure 1.3.2). 

Cardano had tapped into the most basic principle in probability. The model 
he discovered may seem trivial in retrospect, but it represented a giant step forward: 
His was the first recorded instance of anyone computing a theoretical, as opposed to 
an empirical, probability. Still, the actual impact of Cardano’s work was minimal. 
He wrote a book in 1525, but its publication was delayed until 1663. By then, the 
focus of the Renaissance, as well as interest in probability, had shifted from Italy to 
France. 

The date cited by many historians (those who are not Cardano supporters) as 
the “beginning” of probability is 1654. In Paris a well-to-do gambler, the Chevalier 
de Méré, asked several prominent mathematicians, including Blaise Pascal, a series 
of questions, the best known of which is the problem of points: 


Two people, A and B, agree to play a series of fair games until one person has won 
six games. They each have wagered the same amount of money, the intention being 
that the winner will be awarded the entire pot. But suppose, for whatever reason, 
the series is prematurely terminated, at which point A has won five games and B 
three. How should the stakes be divided? 


[The correct answer is that A should receive seven-eighths of the total amount 
wagered. (Hint: Suppose the contest were resumed. What scenarios would lead to 
A’s being the first person to win six games?)] 

Pascal was intrigued by de Méré’s questions and shared his thoughts with Pierre 
Fermat, a Toulouse civil servant and probably the most brilliant mathematician in 
Europe. Fermat graciously replied, and from the now-famous Pascal-Fermat corre- 
spondence came not only the solution to the problem of points but the foundation 
for more general results. More significantly, news of what Pascal and Fermat were 
working on spread quickly. Others got involved, of whom the best known was the 
Dutch scientist and mathematician Christiaan Huygens. The delays and the indif- 
ference that had plagued Cardano a century earlier were not going to happen 
again. 

Best remembered for his work in optics and astronomy, Huygens, early in his 
career, was intrigued by the problem of points. In 1657 he published De Ratiociniis 
in Aleae Ludo (Calculations in Games of Chance), a very significant work, far more 
comprehensive than anything Pascal and Fermat had done. For almost fifty years it 
was the standard “textbook” in the theory of probability. Not surprisingly, Huygens 
has supporters who feel that he should be credited as the founder of probability. 

Almost all the mathematics of probability was still waiting to be discovered. 
What Huygens wrote was only the humblest of beginnings, a set of fourteen propo- 
sitions bearing little resemblance to the topics we teach today. But the foundation 
was there. The mathematics of probability was finally on firm footing. 


1.3 A Brief History I] 


Statistics: From Aristotle to Quetelet 


Historians generally agree that the basic principles of statistical reasoning began 
to coalesce in the middle of the nineteenth century. What triggered this emergence 
was the union of three different “sciences,” each of which had been developing along 
more or less independent lines (195). 

The first of these sciences, what the Germans called Staatenkunde, involved 
the collection of comparative information on the history, resources, and military 
prowess of nations. Although efforts in this direction peaked in the seventeenth 
and eighteenth centuries, the concept was hardly new: Aristotle had done some- 
thing similar in the fourth century B.c. Of the three movements, this one had the 
least influence on the development of modern statistics, but it did contribute some 
terminology: The word statistics, itself, first arose in connection with studies of 
this type. 

The second movement, known as political arithmetic, was defined by one of 
its early proponents as “the art of reasoning by figures, upon things relating to 
government.” Of more recent vintage than Staatenkunde, political arithmetic’s roots 
were in seventeenth-century England. Making population estimates and construct- 
ing mortality tables were two of the problems it frequently dealt with. In spirit, 
political arithmetic was similar to what is now called demography. 

The third component was the development of a calculus of probability. As we 
saw earlier, this was a movement that essentially started in seventeenth-century 
France in response to certain gambling questions, but it quickly became the “engine” 
for analyzing all kinds of data. 


Staatenkunde: The Comparative Description of States 


The need for gathering information on the customs and resources of nations has 
been obvious since antiquity. Aristotle is credited with the first major effort toward 
that objective: His Politeiai, written in the fourth century B.c., contained detailed 
descriptions of some 158 different city-states. Unfortunately, the thirst for knowl- 
edge that led to the Politeiai fell victim to the intellectual drought of the Dark Ages, 
and almost two thousand years elapsed before any similar projects of like magnitude 
were undertaken. 

The subject resurfaced during the Renaissance, and the Germans showed the 
most interest. They not only gave it a name, Staatenkunde, meaning “the compara- 
tive description of states,” but they were also the first (in 1660) to incorporate the 
subject into a university curriculum. A leading figure in the German movement was 
Gottfried Achenwall, who taught at the University of Géttingen during the middle 
of the eighteenth century. Among Achenwall’s claims to fame is that he was the first 
to use the word statistics in print. It appeared in the preface of his 1749 book Abriss 
der Statswissenschaft der heutigen vornehmsten europaishen Reiche und Republiken. 
(The word statistics comes from the Italian root stato, meaning “state,” implying 
that a statistician is someone concerned with government affairs.) As terminology, 
it seems to have been well-received: For almost one hundred years the word statistics 
continued to be associated with the comparative description of states. In the middle 
of the nineteenth century, though, the term was redefined, and statistics became the 
new name for what had previously been called political arithmetic. 

How important was the work of Achenwall and his predecessors to the devel- 
opment of statistics? That would be difficult to say. To be sure, their contributions 
were more indirect than direct. They left no methodology and no general theory. But 
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they did point out the need for collecting accurate data and, perhaps more impor- 
tantly, reinforced the notion that something complex—even as complex as an entire 
nation—can be effectively studied by gathering information on its component parts. 
Thus, they were lending important support to the then-growing belief that induction, 
rather than deduction, was a more sure-footed path to scientific truth. 


Political Arithmetic 


In the sixteenth century the English government began to compile records, called 
bills of mortality, on a parish-to-parish basis, showing numbers of deaths and their 
underlying causes. Their motivation largely stemmed from the plague epidemics that 
had periodically ravaged Europe in the not-too-distant past and were threatening to 
become a problem in England. Certain government officials, including the very influ- 
ential Thomas Cromwell, felt that these bills would prove invaluable in helping to 
control the spread of an epidemic. At first, the bills were published only occasionally, 
but by the early seventeenth century they had become a weekly institution.” 

Figure 1.3.3 (on the next page) shows a portion of a bill that appeared in London 
in 1665. The gravity of the plague epidemic is strikingly apparent when we look at 
the numbers at the top: Out of 97,306 deaths, 68,596 (over 70%) were caused by 
the plague. The breakdown of certain other afflictions, though they caused fewer 
deaths, raises some interesting questions. What happened, for example, to the 23 
people who were “frighted” or to the 397 who suffered from “rising of the lights”? 

Among the faithful subscribers to the bills was John Graunt, a London mer- 
chant. Graunt not only read the bills, he studied them intently. He looked for 
patterns, computed death rates, devised ways of estimating population sizes, and 
even set up a primitive life table. His results were published in the 1662 treatise 
Natural and Political Observations upon the Bills of Mortality. This work was a land- 
mark: Graunt had launched the twin sciences of vital statistics and demography, and, 
although the name came later, it also signaled the beginning of political arithmetic. 
(Graunt did not have to wait long for accolades; in the year his book was published, 
he was elected to the prestigious Royal Society of London.) 

High on the list of innovations that made Graunt’s work unique were his objec- 
tives. Not content simply to describe a situation, although he was adept at doing so, 
Graunt often sought to go beyond his data and make generalizations (or, in current 
statistical terminology, draw inferences). Having been blessed with this particular 
turn of mind, he almost certainly qualifies as the world’s first statistician. All Graunt 
really lacked was the probability theory that would have enabled him to frame his 
inferences more mathematically. That theory, though, was just beginning to unfold 
several hundred miles away in France (151). 

Other seventeenth-century writers were quick to follow through on Graunt’s 
ideas. William Petty’s Political Arithmetick was published in 1690, although it had 
probably been written some fifteen years earlier. (It was Petty who gave the move- 
ment its name.) Perhaps even more significant were the contributions of Edmund 
Halley (of “Halley’s comet” fame). Principally an astronomer, he also dabbled in 
political arithmetic, and in 1693 wrote An Estimate of the Degrees of the Mortal- 
ity of Mankind, drawn from Curious Tables of the Births and Funerals at the city of 
Breslaw; with an attempt to ascertain the Price of Annuities upon Lives. (Book titles 


2 An interesting account of the bills of mortality is given in Daniel Defoe’s A Journal of the Plague Year, which 
purportedly chronicles the London plague outbreak of 1665. 
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The bill for the year—A General Bill for this present year, ending the 19 of 

December, 1665, according to the Report made to the King’s most excellent 

Majesty, by the Co. of Parish Clerks of Lond., & c.—gives the following sum- 

mary of the results; the details of the several parishes we omit, they being made 

as in 1625, except that the out-parishes were now 12:— 

Buried in the 27 Parishes within the walls ................. ccc ceceeeeeee 15,207 

Whereof of the plague ....... 0. cece cece eee eee nena 9,887 

Buried in the 16 Parishes without the walls ............ 0... ccc ee cece eee 41,351 

Whereof of the: plague: vie. cogs cea cevelieioudetec ieee peed deeb est ben 28,838 

At the Pesthouse, total buried. ........... 0.0 c ccc ccc cece een ee tenes 159 

OF the plague cts tieparoh ey eBay gausie Ba, Sasp natin te(eoa Gueshinterbonnd We Sete tig easihuabcens 156 

Buried in the 12 out-Parishes in Middlesex and surrey.................. 18,554 

Whereof of the plagtle.. a ucrn cere ties Sadia naan van ae eg ded 21,420 

Buried in the 5 Parishes in the City and Liberties of Westminster........ 12,194 

Whereof the plague. ison copie cuigt ene oas crcl ndae tatters ste pede edie dere 8,403 

The total of all the christenings............. 00... c cee ce cece eee eee 9,967 

The total of all the burials this year.......... 0... cece eee eee es 97,306 

Whereot of the pla gtte.:.i..:cy clus de casa eine hese v eae abee ba nie heed 68,596 
Abortive and Stillborne.......... 617 Griping in the Guts................ 1288-. -Palsiesscc.cotdeteuleee. 30 
Aged ecb ioc asian sesamiae: 1,545 Hang’d & made away themselved .. To Plague: wicca Seats eheeiderreetes 68,596 
Ague & Feaver...............065 5,257 Headmould shot and mould fallen. . 14 Plannet. ins. cc scenes in ces 6 
Appolex and Suddenly........... 116° Jaundice vs 2s eae ae ee eet 110» oPhurisi€ si. ede vas 15 
Bedrid 5... cu. %sadehtnuratekicenneea-s 10 Impostume..................0..008 227 Poysoned................4. 1 
Blasted c.dewtwrowrnercabancardess 5 Kill by several accidents ........... 46 Quinsie................008. 35 
Bleeding a. si ogc nraeraiee ceed 16° King’s Evils os cco. cover esc ces : 86 Rickets.................00. 535 
Cold & Cough................04. 68. Reprosies en tit aceehe de. 2 Rising of the Lights ........ 397 
Collick & Winde................. 134, - Lethargy. ccc nc cee eel at ee etes 14 Rupture ........... ee. 34 
Comsumption & Tissick.......... 4,808 Livergrown...................00088 20: SCUITY vis caveie cei cess 105 
Convulsion & Mother............ 2,036 Bloody Flux, Scowring & Flux..... 18 Shingles & Swine Pox...... 2 
Distracted .............. 0c ce eee 5 Burnt and Scalded................. 8 Sores, Ulcers, Broken and 
Dropsie & Timpany.............. 1,478 Calenture....................0000. 3 Bruised Limbs............. 82 
Drowned. ie Se cust eves 50 Cancer, Cangrene & Fistula........ 56). Spleen ess wcSeuiedeay ae eee 14 
BX€CUtEd sect. Paste utin seers 21 Canker and Thrush................ 111 Spotted Feaver & Purples.. 1,929 
Flox & Smallpox................. 655). ‘Childbed svccgicre a yetowsh say gars 625 Stopping of the Stomach ... 332 
Found Dead in streets, fields, &c. . 20 Chrisomes and Infants............. 1,258 Stone and Stranguary...... 98 
French Pox .................00005 86 Meagrom and Headach............ 12> SUITE cleo Sense aes 1,251 
Prighted  escex caesdes yatceaoe nists es 23° “Measles: fi idiieweeweckedeen cece: 7 Teeth & Worms............ 2,614 
Gout & Sciatica..............0.6. 27 Murthered & Shot................. 9» -VOMItING fo sciieseheca vee tees 51 
GTIEE 5c Ses eds hee tie pteteeeard Ne ea ae 46 Overlaid & Starved................ AS WENN. var k nce atanetian 64 eh 8 
Christened-Males................ S114: (Females iiecés event dacaataeavacce: 4.8532 ° Ural ce irs Sashes eancea een 9,967 
Buried-Males................0006 58,569 Females..............c cece ence e aes 48,737 Imall...................008. 97,306 
OFS Pla GUS ri c.5 icra ee sn th ae arene wales doataleet dete a Bian inig edn eank iu hte arbi ule ea erdeec ell dehrennrite manga tate wlale geen gate 68,596 
Increase in the Burials in the 130 Parishes and the Pesthouse this year........... 0... e cece cece cece eee e nee ences 79,009 
Increase of the Plague in the 130 Parishes and the Pesthouse this year........... 0. cc cee cece cece cece een een ene e eens 68,590 


Figure 1.3.3 


were longer then!) Halley shored up, mathematically, the efforts of Graunt and oth- 
ers to construct an accurate mortality table. In doing so, he laid the foundation for 
the important theory of annuities. Today, all life insurance companies base their pre- 
mium schedules on methods similar to Halley’s. (The first company to follow his lead 
was The Equitable, founded in 1765.) 

For all its initial flurry of activity, political arithmetic did not fare particularly 
well in the eighteenth century, at least in terms of having its methodology fine-tuned. 
Still, the second half of the century did see some notable achievements in improving 
the quality of the databases: Several countries, including the United States in 1790, 
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established a periodic census. To some extent, answers to the questions that inter- 
ested Graunt and his followers had to be deferred until the theory of probability 
could develop just a little bit more. 


Quetelet: The Catalyst 


With political arithmetic furnishing the data and many of the questions, and the the- 
ory of probability holding out the promise of rigorous answers, the birth of statistics 
was at hand. All that was needed was a catalyst—someone to bring the two together. 
Several individuals served with distinction in that capacity. Carl Friedrich Gauss, the 
superb German mathematician and astronomer, was especially helpful in showing 
how statistical concepts could be useful in the physical sciences. Similar efforts in 
France were made by Laplace. But the man who perhaps best deserves the title of 
“matchmaker” was a Belgian, Adolphe Quetelet. 

Quetelet was a mathematician, astronomer, physicist, sociologist, anthropolo- 
gist, and poet. One of his passions was collecting data, and he was fascinated by the 
regularity of social phenomena. In commenting on the nature of criminal tendencies, 
he once wrote (70): 


Thus we pass from one year to another with the sad perspective of seeing the same 
crimes reproduced in the same order and calling down the same punishments in the 
same proportions. Sad condition of humanity! ...We might enumerate in advance 
how many individuals will stain their hands in the blood of their fellows, how many 
will be forgers, how many will be poisoners, almost we can enumerate in advance the 
births and deaths that should occur. There is a budget which we pay with a frightful 
regularity; it is that of prisons, chains and the scaffold. 


Given such an orientation, it was not surprising that Quetelet would see in prob- 
ability theory an elegant means for expressing human behavior. For much of the 
nineteenth century he vigorously championed the cause of statistics, and as a mem- 
ber of more than one hundred learned societies, his influence was enormous. When 
he died in 1874, statistics had been brought to the brink of its modern era. 


1.4 A Chapter Summary 


The concepts of probability lie at the very heart of all statistical problems. Acknowl- 
edging that fact, the next two chapters take a close look at some of those concepts. 
Chapter 2 states the axioms of probability and investigates their consequences. It 
also covers the basic skills for algebraically manipulating probabilities and gives an 
introduction to combinatorics, the mathematics of counting. Chapter 3 reformulates 
much of the material in Chapter 2 in terms of random variables, the latter being a 
concept of great convenience in applying probability to statistics. Over the years, 
particular measures of probability have emerged as being especially useful: The 
most prominent of these are profiled in Chapter 4. 

Our study of statistics proper begins with Chapter 5, which is a first look at 
the theory of parameter estimation. Chapter 6 introduces the notion of hypothesis 
testing, a procedure that, in one form or another, commands a major share of the 
remainder of the book. From a conceptual standpoint, these are very important 
chapters: Most formal applications of statistical methodology will involve either 
parameter estimation or hypothesis testing, or both. 
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Among the probability functions featured in Chapter 4, the normal distribu- 
tion—more familiarly known as the bell-shaped curve—is sufficiently important to 
merit even further scrutiny. Chapter 7 derives in some detail many of the properties 
and applications of the normal distribution as well as those of several related prob- 
ability functions. Much of the theory that supports the methodology appearing in 
Chapters 9 through 13 comes from Chapter 7. 

Chapter 8 describes some of the basic principles of experimental “design.” 
Its purpose is to provide a framework for comparing and contrasting the various 
statistical procedures profiled in Chapters 9 through 14. 

Chapters 9, 12, and 13 continue the work of Chapter 7, but with the emphasis 
on the comparison of several populations, similar to what was done in Case Study 
1.2.2. Chapter 10 looks at the important problem of assessing the level of agreement 
between a set of data and the values predicted by the probability model from which 
those data presumably came. Linear relationships are examined in Chapter 11. 

Chapter 14 is an introduction to nonparametric statistics. The objective there is 
to develop procedures for answering some of the same sorts of questions raised in 
Chapters 7, 9, 12, and 13, but with fewer initial assumptions. 

As a general format, each chapter contains numerous examples and case stud- 
ies, the latter including actual experimental data taken from a variety of sources, 
primarily newspapers, magazines, and technical journals. We hope that these appli- 
cations will make it abundantly clear that, while the general orientation of this text 
is theoretical, the consequences of that theory are never too far from having direct 
relevance to the “real world.” 
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One of the most influential of seventeenth-century mathematicians, Fermat earned his 
living as a lawyer and administrator in Toulouse. He shares credit with Descartes for 
the invention of analytic geometry, but his most important work may have been in 
number theory. Fermat did not write for publication, preferring instead to send letters 
and papers to friends. His correspondence with Pascal was the starting point for the 
development of a mathematical theory of probability. 

— Pierre de Fermat (1601-1665) 


Pascal was the son of a nobleman. A prodigy of sorts, he had already published a 
treatise on conic sections by the age of sixteen. He also invented one of the early 
calculating machines to help his father with accounting work. Pascal’s contributions 
to probability were stimulated by his correspondence, in 1654, with Fermat. Later 
that year he retired to a life of religious meditation. 

—Blaise Pascal (1623-1662) 


2.1 Introduction 


Experts have estimated that the likelihood of any given UFO sighting being genuine 
is on the order of one in one hundred thousand. Since the early 1950s, some ten 
thousand sightings have been reported to civil authorities. What is the probability 
that at least one of those objects was, in fact, an alien spacecraft? In 1978, Pete Rose 
of the Cincinnati Reds set a National League record by batting safely in forty-four 
consecutive games. How unlikely was that event, given that Rose was a lifetime 
.303 hitter? By definition, the mean free path is the average distance a molecule in a 
gas travels before colliding with another molecule. How likely is it that the distance a 
molecule travels between collisions will be at least twice its mean free path? Suppose 
a boy’s mother and father both have genetic markers for sickle cell anemia, but 
neither parent exhibits any of the disease’s symptoms. What are the chances that 
their son will also be asymptomatic? What are the odds that a poker player is dealt 


16 


Figure 2.1.1 
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a full house or that a craps-shooter makes his “point”? If a woman has lived to 
age seventy, how likely is it that she will die before her ninetieth birthday? In 1994, 
Tom Foley was Speaker of the House and running for re-election. The day after the 
election, his race had still not been “called” by any of the networks: he trailed his 
Republican challenger by 2174 votes, but 14,000 absentee ballots remained to be 
counted. Foley, however, conceded. Should he have waited for the absentee ballots 
to be counted, or was his defeat at that point a virtual certainty? 

As the nature and variety of these questions would suggest, probability is a sub- 
ject with an extraordinary range of real-world, everyday applications. What began 
as an exercise in understanding games of chance has proven to be useful every- 
where. Maybe even more remarkable is the fact that the solutions to all of these 
diverse questions are rooted in just a handful of definitions and theorems. Those 
results, together with the problem-solving techniques they empower, are the sum 
and substance of Chapter 2. We begin, though, with a bit of history. 


The Evolution of the Definition of Probability 


Over the years, the definition of probability has undergone several revisions. There 
is nothing contradictory in the multiple definitions—the changes primarily reflected 
the need for greater generality and more mathematical rigor. The first formulation 
(often referred to as the classical definition of probability) is credited to Gerolamo 
Cardano (recall Section 1.3). It applies only to situations where (1) the number of 
possible outcomes is finite and (2) all outcomes are equally likely. Under those con- 
ditions, the probability of an event comprised of m outcomes is the ratio m/n, where 
n is the total number of (equally likely) outcomes. Tossing a fair, six-sided die, for 
example, gives m/n = : as the probability of rolling an even number (that is, either 
2, 4, or 6). 

While Cardano’s model was well-suited to gambling scenarios (for which it was 
intended), it was obviously inadequate for more general problems, where outcomes 
are not equally likely and/or the number of outcomes is not finite. Richard von 
Mises, a twentieth-century German mathematician, is often credited with avoid- 
ing the weaknesses in Cardano’s model by defining “empirical” probabilities. In the 
von Mises approach, we imagine an experiment being repeated over and over again 
under presumably identical conditions. Theoretically, a running tally could be kept 
of the number of times (m) the outcome belongs to a given event divided by n, the 
total number of times the experiment is performed. According to von Mises, the 
probability of the given event is the limit (as n goes to infinity) of the ratio m/n. 
Figure 2.1.1 illustrates the empirical probability of getting a head by tossing a fair 
coin: as the number of tosses continues to increase, the ratio m/n converges to 5. 


lim m/n 


n—oo 


35 


n = numbers of trials 
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Example 


2.2.1 


The von Mises approach definitely shores up some of the inadequacies seen in 
the Cardano model, but it is not without shortcomings of its own. There is some 
conceptual inconsistency, for example, in extolling the limit of m/n as a way of defin- 
ing a probability empirically, when the very act of repeating an experiment under 
identical conditions an infinite number of times is physically impossible. And left 
unanswered is the question of how large n must be in order for m/n to be a good 
approximation for lim m/n. 

Andrei Kolmogorov, the great Russian probabilist, took a different approach. 
Aware that many twentieth-century mathematicians were having success developing 
subjects axiomatically, Kolmogorov wondered whether probability might similarly 
be defined operationally, rather than as a ratio (like the Cardano model) or as a 
limit (like the von Mises model). His efforts culminated in a masterpiece of mathe- 
matical elegance when he published Grundbegriffe der Wahrscheinlichkeitsrechnung 
(Foundations of the Theory of Probability) in 1933. In essence, Kolmogorov was able 
to show that a maximum of four simple axioms is necessary and sufficient to define 
the way any and all probabilities must behave. (These will be our starting point in 
Section 2.3.) 

We begin Chapter 2 with some basic (and, presumably, familiar) definitions 
from set theory. These are important because probability will eventually be defined 
as a set function—that is, a mapping from a set to a number. Then, with the help 
of Kolmogorov’s axioms in Section 2.3, we will learn how to calculate and manipu- 
late probabilities. The chapter concludes with an introduction to combinatorics — the 
mathematics of systematic counting —and its application to probability. 


2.2 Sample Spaces and the Algebra of Sets 


The starting point for studying probability is the definition of four key terms: exper- 
iment, sample outcome, sample space, and event. The latter three, all carryovers 
from classical set theory, give us a familiar mathematical framework within which to 
work; the former is what provides the conceptual mechanism for casting real-world 
phenomena into probabilistic terms. 

By an experiment we will mean any procedure that (1) can be repeated, the- 
oretically, an infinite number of times; and (2) has a well-defined set of possible 
outcomes. Thus, rolling a pair of dice qualifies as an experiment; so does measuring 
a hypertensive’s blood pressure or doing a spectrographic analysis to determine the 
carbon content of moon rocks. Asking a would-be psychic to draw a picture of an 
image presumably transmitted by another would-be psychic does not qualify as an 
experiment, because the set of possible outcomes cannot be listed, characterized, or 
otherwise defined. 

Each of the potential eventualities of an experiment is referred to as a sample 
outcome, s, and their totality is called the sample space, S. To signify the membership 
of s in S, we write s € S. Any designated collection of sample outcomes, including 
individual outcomes, the entire sample space, and the null set, constitutes an event. 
The latter is said to occur if the outcome of the experiment is one of the members 
of the event. 


Consider the experiment of flipping a coin three times. What is the sample space? 

Which sample outcomes make up the event A: Majority of coins show heads? 
Think of each sample outcome here as an ordered triple, its components repre- 

senting the outcomes of the first, second, and third tosses, respectively. Altogether, 


Example 
2.2.2 


Example 
2.2.3 


Example 
2.2.4 


Example 
2.2.5 
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there are eight different triples, so those eight comprise the sample space: 
S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT} 


By inspection, we see that four of the sample outcomes in S constitute the event A: 
A= {HHH, HHT, HTH, THH} = 


Imagine rolling two dice, the first one red, the second one green. Each sample out- 
come is an ordered pair (face showing on red die, face showing on green die), and 
the entire sample space can be represented as a 6 x 6 matrix (see Figure 2.2.1). 


Face showing on green die 


1 2 3 4 5 6 
snes 

91 @) @2 (1,3) (1,4) (1,5) --7, 6) 

Bot Loo a 

3B 2 (2, 1) (2, 2) (2,3) @,4) 2-7", 5) --""@, 6) 

63 G1) 62 3,3) ---G,4)--- 35) 3,6) 

24 G&D) 42---G3---G4) 65 46) 

3B 5 (5.1)---62)_---B3) (5, 4) (5, 5) (5, 6) 

x 


6 6,1).-7 6.2) (6, 3) (6, 4) (6,5) (6, 6) 
Figure 2.2.1 


Gamblers are often interested in the event A that the sum of the faces showing 
is a 7. Notice in Figure 2.2.1 that the sample outcomes contained in A are the six 
diagonal entries, (1, 6), (2,5), (3, 4), (4, 3), (5, 2), and (6, 1). = 


A local TV station advertises two newscasting positions. If three women (W,, W2, 
W3) and two men (M;, M2) apply, the “experiment” of hiring two coanchors 
generates a sample space of ten outcomes: 


S={(Wi, W2), (Wi, W3), (Wo, W3), (Wi, Mi), (Wi, M2), (W2, M1), 
(W, M2), (W3, M,), (W3, M)), (M,, M))} 


Does it matter here that the two positions being filled are equivalent? Yes. If the 
station were seeking to hire, say, a sports announcer and a weather forecaster, 
the number of possible outcomes would be twenty: (W2, M,), for example, would 
represent a different staffing assignment than (M,, W2). = 


The number of sample outcomes associated with an experiment need not be 
finite. Suppose that a coin is tossed until the first tail appears. If the first toss is 
itself a tail, the outcome of the experiment is T; if the first tail occurs on the second 
toss, the outcome is HT; and so on. Theoretically, of course, the first tail may never 
occur, and the infinite nature of S is readily apparent: 


S={T, HT, HHT, HHHT,...} o 


There are three ways to indicate an experiment’s sample space. If the number of pos- 
sible outcomes is small, we can simply list them, as we did in Examples 2.2.1 through 
2.2.3. In some cases it may be possible to characterize a sample space by showing the 
structure its outcomes necessarily possess. This is what we did in Example 2.2.4. 
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A third option is to state a mathematical formula that the sample outcomes must 
satisfy. 

A computer programmer is running a subroutine that solves a general 
quadratic equation, ax? + bx +c =0. Her “experiment” consists of choosing val- 
ues for the three coefficients a, b, and c. Define (1) S and (2) the event A: Equation 
has two equal roots. 

First, we must determine the sample space. Since presumably no combinations 
of finite a, b, and c are inadmissible, we can characterize S by writing a series of 
inequalities: 


S={(a,b,c):—00 <a < 00, —CO <b < ~&, —00 <c < oo} 


Defining A requires the well-known result from algebra that a quadratic equation 
has equal roots if and only if its discriminant, b? — 4ac, vanishes. Membership in A, 
then, is contingent on a, b, and c satisfying an equation: 


A={(a,b,c):b? —4ac=0} = 


Questions 


2.2.1. A graduating engineer has signed up for three job 
interviews. She intends to categorize each one as being 
either a “success” or a “failure” depending on whether 
it leads to a plant trip. Write out the appropriate sam- 
ple space. What outcomes are in the event A: Second 
success occurs on third interview? In B: First success 
never occurs? (Hint: Notice the similarity between this 
situation and the coin-tossing experiment described in 
Example 2.2.1.) 


2.2.2. Three dice are tossed, one red, one blue, and one 
green. What outcomes make up the event A that the sum 
of the three faces showing equals 5? 


2.2.3. An urn contains six chips numbered 1 through 6. 
Three are drawn out. What outcomes are in the event 
“Second smallest chip is a 3”? Assume that the order of 
the chips is irrelevant. 


2.2.4. Suppose that two cards are dealt from a standard 
52-card poker deck. Let A be the event that the sum of 
the two cards is 8 (assume that aces have a numerical value 
of 1). How many outcomes are in A? 


2.2.5. In the lingo of craps-shooters (where two dice are 
tossed and the underlying sample space is the matrix pic- 
tured in Figure 2.2.1) is the phrase “making a hard eight.” 
What might that mean? 


2.2.6. A poker deck consists of fifty-two cards, represent- 
ing thirteen denominations (2 through ace) and four suits 
(diamonds, hearts, clubs, and spades). A five-card hand is 
called a flush if all five cards are in the same suit but not all 
five denominations are consecutive. Pictured in the next 
column is a flush in hearts. Let N be the set of five cards in 
hearts that are not flushes. How many outcomes are in N? 


[Note: In poker, the denominations (A, 2, 3, 4, 5) are con- 


sidered to be consecutive (in addition to sequences such 
as (8, 9, 10, J, Q)).] 


Denominations 


2345678910 JQKA 


D 

H X X x xX X 
C 

S 


2.2.7. Let P be the set of right triangles with a 5” 
hypotenuse and whose height and length are a and b, 
respectively. Characterize the outcomes in P. 


2.2.8. Suppose a baseball player steps to the plate with 
the intention of trying to “coax” a base on balls by never 
swinging at a pitch. The umpire, of course, will necessar- 
ily call each pitch either a ball (B) or a strike (S). What 
outcomes make up the event A, that a batter walks on the 
sixth pitch? (Note: A batter “walks” if the fourth ball is 
called before the third strike.) 


2.2.9. A telemarketer is planning to set up a phone 
bank to bilk widows with a Ponzi scheme. His past expe- 
rience (prior to his most recent incarceration) suggests 
that each phone will be in use half the time. For a given 
phone at a given time, let 0 indicate that the phone is 
available and let 1 indicate that a caller is on the line. Sup- 
pose that the telemarketer’s “bank” is comprised of four 
telephones. 


(a) Write out the outcomes in the sample space. 

(b) What outcomes would make up the event that 
exactly two phones are being used? 

(c) Suppose the telemarketer had k phones. How many 
outcomes would allow for the possibility that at most 
one more call could be received? (Hint: How many 
lines would have to be busy?) 


2.2.10. Two darts are thrown at the following target: 


(a) Let (u, v) denote the outcome that the first dart lands 
in region u and the second dart, in region v. List the 
sample space of (u, v)’s. 

(b) List the outcomes in the sample space of sums, u + v. 


2.2.11. A woman has her purse snatched by two 
teenagers. She is subsequently shown a police lineup con- 
sisting of five suspects, including the two perpetrators. 
What is the sample space associated with the experiment 
“Woman picks two suspects out of lineup”? Which out- 
comes are in the event A: She makes at least one incorrect 
identification? 


2.2.12. Consider the experiment of choosing coefficients 
for the quadratic equation ax? + bx +c=0. Characterize 
the values of a,b, and c associated with the event A: 
Equation has complex roots. 


2.2 Sample Spaces and the Algebra of Sets 21 


2.2.13. In the game of craps, the person rolling the dice 
(the shooter) wins outright if his first toss is a 7 or an 11. 
If his first toss is a 2, 3, or 12, he loses outright. If his first 
roll is something else, say, a 9, that number becomes his 
“point” and he keeps rolling the dice until he either rolls 
another 9, in which case he wins, or a 7, in which case he 
loses. Characterize the sample outcomes contained in the 
event “Shooter wins with a point of 9.” 


2.2.14. A probability-minded despot offers a convicted 
murderer a final chance to gain his release. The prisoner 
is given twenty chips, ten white and ten black. All twenty 
are to be placed into two urns, according to any allo- 
cation scheme the prisoner wishes, with the one proviso 
being that each urn contain at least one chip. The execu- 
tioner will then pick one of the two urns at random and 
from that urn, one chip at random. If the chip selected is 
white, the prisoner will be set free; if it is black, he “buys 
the farm.” Characterize the sample space describing the 
prisoner’s possible allocation options. (Intuitively, which 
allocation affords the prisoner the greatest chance of 
survival?) 


2.2.15. Suppose that ten chips, numbered 1 through 10, 
are put into an urn at one minute to midnight, and chip 
number 1 is quickly removed. At one-half minute to mid- 
night, chips numbered 11 through 20 are added to the urn, 
and chip number 2 is quickly removed. Then at one-fourth 
minute to midnight, chips numbered 21 to 30 are added to 
the urn, and chip number 3 is quickly removed. If that pro- 
cedure for adding chips to the urn continues, how many 
chips will be in the urn at midnight (148)? 


Unions, Intersections, and Complements 


Associated with events defined on a sample space are several operations collectively 
referred to as the algebra of sets. These are the rules that govern the ways in which 
one event can be combined with another. Consider, for example, the game of craps 
described in Question 2.2.13. The shooter wins on his initial roll if he throws either 
a7 or an 11. In the language of the algebra of sets, the event “Shooter rolls a 7 or 
an 11” is the union of two simpler events, “Shooter rolls a 7” and “Shooter rolls 
an 11.” If E denotes the union and if A and B denote the two events making up the 
union, we write E = AU B. The next several definitions and examples illustrate those 
portions of the algebra of sets that we will find particularly useful in the chapters 


ahead. 


space S. Then 


Definition 2.2.1. Let A and B be any two events defined over the same sample 


a. The intersection of A and B, written AN B, is the event whose outcomes 
belong to both A and B. 

b. The union of A and B, written AU B, is the event whose outcomes belong 
to either A or B or both. 
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Example 


2.2.6 


Example 
2.2.7 


Example 
2.2.8 


A single card is drawn from a poker deck. Let A be the event that an ace is selected: 
A= {aceofhearts, ace of diamonds, ace of clubs, ace of spades} 


Let B be the event “Heart is drawn”: 


B= {2 of hearts, 3 of hearts,..., ace of hearts} 
Then 
AN B= {ace of hearts} 
and 
AUB= {2 of hearts, 3 of hearts,..., ace of hearts, ace of diamonds, 
ace of clubs, ace of spades} 
(Let C be the event “Club is drawn.” Which cards are in BUC? In BNC?) = 


Let A be the set of x’s for which x? + 2x =8; let B be the set for which x2 + x =6. 
Find AN B and AUB. 

Since the first equation factors into (x + 4)(x — 2) =0, its solution set is A = 
{—4, 2}. Similarly, the second equation can be written (x + 3)(x — 2) = 0, making 
B={-3, 2}. Therefore, 


AN B= {2} 
and 
AU B= {-4, —3, 2} = 
Consider the electrical circuit pictured in Figure 2.2.2. Let A; denote the event that 


switch i fails to close, i = 1,2,3,4. Let A be the event “Circuit is not completed.” 
Express A in terms of the A;’s. 


Or @! 


Oy oF ' 


Figure 2.2.2 


NW 


Call the © and @ switches line a; call the ® and ® switches line b. By inspection, 
the circuit fails only if both line a and line b fail. But line a fails only if either © or 
@ (or both) fail. That is, the event that line a fails is the union A, U A». Similarly, 
the failure of line b is the union A3 U Ay. The event that the circuit fails, then, is an 
intersection: 


A=(A,UA2)N(A3U Aa) = 


Definition 2.2.2. Events A and B defined over the same sample space are said 
to be mutually exclusive if they have no outcomes in common-—thatis, if AN B= 
#, where @ is the null set. 
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Example 
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Example 
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Questions 
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Consider a single throw of two dice. Define A to be the event that the sum of the 
faces showing is odd. Let B be the event that the two faces themselves are odd. Then 
clearly, the intersection is empty, the sum of two odd numbers necessarily being 
even. In symbols, AN B = 9. (Recall the event BNC asked for in Example 2.2.6.) m 


Definition 2.2.3. Let A be any event defined on a sample space S. The com- 
plement of A, written A, is the event consisting of all the outcomes in S other 
than those contained in A. 


Let A be the set of (x, y)’s for which x? + y* < 1. Sketch the region in the xy-plane 
corresponding to A. 

From analytic geometry, we recognize that x? + y* < 1 describes the interior of 
a circle of radius 1 centered at the origin. Figure 2.2.3 shows the complement—the 
points on the circumference of the circle and the points outside the circle. 


y 


AC :x2 +y2 21 


Figure 2.2.3 = 


The notions of union and intersection can easily be extended to more than 
two events. For example, the expression A; U Az U---U Ax defines the set of out- 
comes belonging to any of the A;’s (or to any combination of the A;’s). Similarly, 
A,NA2N---N Ax, is the set of outcomes belonging to all of the A;’s. 


Suppose the events Aj, Ao,..., Ag are intervals of real numbers such that 
Aj={x:0<x<1/i}, i=1,2,...,k 


Describe the sets Ay U A U-+-U Ay =UE_, Aj and Ay MAN» Ag =O, Aj. 
Notice that the A;’s are telescoping sets. That is, A; is the interval 0<x <1, Ao 

is the interval 0 <x < S, and so on. It follows, then, that the union of the k A;’s is 

simply A; while the intersection of the A;’s (that is, their overlap) is Ax. a 


2.2.16. Sketch the regions in the xy-plane corresponding 2.2.17. Referring to Example 2.2.7, find AN B and AU B if 


toAUB and AN Bit 


A={(x, y):0<x <3,0<y<3} 


and 


the two equations were replaced by inequalities: x” + 2x < 
8 and x7+x <6. 


2.2.18. Find AN BNC if A={x: 0<x <4}, B={x:2< 


B={(x,y):2<x<4,2<y<4} x < 6},and C={x:x=0,1,2,...}. 
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2.2.19. An electronic system has four components 
divided into two pairs. The two components of each pair 
are wired in parallel; the two pairs are wired in series. Let 
A;; denote the event “ith component in jth pair fails,” 
i=1,2; 7 =1,2. Let A be the event “System fails.” Write 
A in terms of the Aj;’s. 


gat jz2 


2.2.20. Define A= {x:0<x <1}, B={x:0<x <3}, and 
C ={x:—-1<x <2}. Draw diagrams showing each of the 
following sets of points: 


(a) ASN BNC 
(b) AT U(BNC) 
(c) ANBNCS 
(d) [((AUB)NC*I]S 


2.2.21. Let A be the set of five-card hands dealt from a 
52-card poker deck, where the denominations of the five 
cards are all consecutive—for example, (7 of hearts, 8 of 
spades, 9 of spades, 10 of hearts, jack of diamonds). Let B 
be the set of five-card hands where the suits of the five 
cards are all the same. How many outcomes are in the 
event AN B? 


2.2.22. Suppose that each of the twelve letters in the 
word 


T E S SS ELLAT TION 


is written on a chip. Define the events F,R, and C as 
follows: 


F: letters in first half of alphabet 
R: letters that are repeated 
V: letters that are vowels 


Which chips make up the following events? 


(a) FARNV 
(b) FONRNVS 
(c) FARONV 


2.2.23. Let A, B, and C be any three events defined on a 
sample space S. Show that 


(a) the outcomes in AU(BMC) are the same as the 
outcomes in (AU B)N (AUC). 

(b) the outcomes in AM (BUC) are the same as the 
outcomes in (AN B)U(ANC). 


2.2.24. Let A;, A2,..., Ax be any set of events defined on 
a sample space S. What outcomes belong to the event 


(A,U A, U++-U Ay) U(AP NAS N+ MAL) 


2.2.25. Let A, B, and C be any three events defined on 
a sample space S. Show that the operations of union and 
intersection are associative by proving that 


(a) AU(BUC)=(AUB)UC=AUBUC 
(b) AN(BNC)=(ANB)NC=ANBNC 


2.2.26. Suppose that three events—A,B, and C—are 
defined on a sample space S. Use the union, intersec- 
tion, and complement operations to represent each of the 
following events: 


(a) none of the three events occurs 
(b) all three of the events occur 

(c) only event A occurs 

(d) exactly one event occurs 

(e) exactly two events occur 


2.2.27. What must be true of events A and B if 


(a) AUB=B 
(b) ANB=A 


2.2.28. Let events A and B and sample space S be defined 
as the following intervals: 


S={x:0<x <10} 
A={x:0<x <5} 
B={x:3<x <7} 


Characterize the following events: 


(a) AC 

(b) ANB 
(c) AUB 
(d) AN BS 
(e) ACUB 
(f) ASM BC 


2.2.29. A coin is tossed four times and the resulting 
sequence of heads and/or tails is recorded. Define the 
events A, B, and C as follows: 


A: exactly two heads appear 
B: heads and tails alternate 
C: first two tosses are heads 


(a) Which events, if any, are mutually exclusive? 
(b) Which events, if any, are subsets of other sets? 


2.2.30. Pictured on the next page are two organizational 
charts describing the way upper management vets new 
proposals. For both models, three vice presidents—1, 2, 
and 3—each voice an opinion. 


Figure 2.2.4 


2.2 Sample Spaces and the Algebra of Sets 25 


>) (2) B)> For (a), all three must concur if the proposal is to pass; 
if any one of the three favors the proposal in (b), it 
aD passes. Let A; denote the event that vice president i 
YY favors the proposal, i = 1,2,3, and let A denote the 
event that the proposal passes. Express A in terms of 
> (2) > the A;’s for the two office protocols. Under what sorts 
of situations might one system be preferable to the 
GB) other? 


Example 
2.2.12 


Expressing Events Graphically: Venn Diagrams 


Relationships based on two or more events can sometimes be difficult to express 
using only equations or verbal descriptions. An alternative approach that can be 
highly effective is to represent the underlying events graphically in a format known 
as a Venn diagram. Figure 2.2.4 shows Venn diagrams for an intersection, a union, 
a complement, and two events that are mutually exclusive. In each case, the shaded 
interior of a region corresponds to the desired event. 


Venn diagrams 


ANB AUB 


nA 
nA 


O @ O ANB=0 


When two events A and B are defined on a sample space, we will frequently need 
to consider 


a. the event that exactly one (of the two) occurs. 
b. the event that at most one (of the two) occurs. 


Getting expressions for each of these is easy if we visualize the corresponding Venn 
diagrams. 

The shaded area in Figure 2.2.5 represents the event E that either A or B, but 
not both, occurs (that is, exactly one occurs). 


Figure 2.2.5 
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Just by looking at the diagram we can formulate an expression for E. The por- 
tion of A, for example, included in E is AN B. Similarly, the portion of B included 
in E is BOA“. It follows that E can be written as a union: 


E=(ANB‘°)U(BNAS) 


(Convince yourself that an equivalent expression for E is (AN B)©N(AUB).) 
Figure 2.2.6 shows the event F that at most one (of the two events) occurs. Since 
the latter includes every outcome except those belonging to both A and B, we can 


write 


Questions 


2.2.31. During orientation week, the latest Spiderman 
movie was shown twice at State University. Among the 
entering class of 6000 freshmen, 850 went to see it the first 
time, 690 the second time, while 4700 failed to see it either 
time. How many saw it twice? 


2.2.32. Let A and B be any two events. Use Venn dia- 
grams to show that 
(a) the complement of their intersection is the union of 
their complements: 
(AN B)€ = ACU BS 


(b) the complement of their union is the intersection of 
their complements: 


(AU B)€ = ACN BS 


(These two results are known as DeMorgan’s laws.) 
2.2.33. Let A, B, and C be any three events. Use Venn 
diagrams to show that 


(a) AN(BUC)=(ANB)U(ANC) 
(b) AU(BNC)=(AUB)N(AUC) 


2.2.34. Let A, B, and C be any three events. Use Venn 
diagrams to show that 


(a) AU(BUC)=(AUB)UC 
(b) AN(BNC)=(ANB)NC 


F=(ANB)© 


Figure 2.2.6 a 


2.2.35. Let A and B be any two events defined on a sam- 
ple space S. Which of the following sets are necessarily 
subsets of which other sets? 


A B AUB ANB ASNB 


AN BS (ACT UBS) 
2.2.36. Use Venn diagrams to suggest an equivalent way 
of representing the following events: 


(a) (AN B®)S 
(b) BU(AUB)® 
(ec) AN(ANB)S 


2.2.37. A total of twelve hundred graduates of State 
Tech have gotten into medical school in the past sev- 
eral years. Of that number, one thousand earned scores 
of twenty-seven or higher on the MCAT and four hun- 
dred had GPAs that were 3.5 or higher. Moreover, three 
hundred had MCATs that were twenty-seven or higher 
and GPAs that were 3.5 or higher. What proportion of 
those twelve hundred graduates got into medical school 
with an MCAT lower than twenty-seven and a GPA 
below 3.5? 


2.2.38. Let A,B, and C be any three events defined 
on a sample space S. Let N(A), N(B), N(C), N(AN 
B), N(ANC), N(BNC), and N(AN BNC) denote the 
numbers of outcomes in all the different intersections in 
which A, B, and C are involved. Use a Venn diagram to 
suggest a formula for N(AU BUC). [Hint: Start with the 


sum N(A) + N(B)+ N(C) and use the Venn diagram to 
identify the “adjustments” that need to be made to that 
sum before it can equal N(AU BUC).] As a precedent, 
note that N(A U B) = N(A) + N(B) — N(AN B). There, 
in the case of two events, subtracting N(A/N B) is the 
“adjustment.” 


2.2.39. A poll conducted by a potential presidential 
candidate asked two questions: (1) Do you support the 
candidate’s position on taxes? and (2) Do you support 
the candidate’s position on homeland security? A total of 
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twelve hundred responses were received; six hundred said 
“yes” to the first question and four hundred said “yes” to 
the second. If three hundred respondents said “no” to the 
taxes question and “yes” to the homeland security ques- 
tion, how many said “yes” to the taxes question but “no” 
to the homeland security question? 


2.2.40. For two events A and B defined on a sample space 
S, N(AN B“) = 15, N(A© MN B) = 50, and N(AN B) =2. 
Given that N(S) = 120, how many outcomes belong to 
neither A nor B? 


2.3 The Probability Function 


Having introduced in Section 2.2 the twin concepts of “experiment” and “sample 
space,” we are now ready to pursue in a formal way the all-important problem 
of assigning a probability to an experiment’s outcome—and, more generally, to an 
event. Specifically, if A is any event defined on a sample space S, the symbol P(A) 
will denote the probability of A, and we will refer to P as the probability function. 
It is, in effect, a mapping from a set (i.e., an event) to a number. The backdrop for 
our discussion will be the unions, intersections, and complements of set theory; the 
starting point will be the axioms referred to in Section 2.1 that were originally set 


forth by Kolmogorov. 


If S has a finite number of members, Kolmogorov showed that as few as three 
axioms are necessary and sufficient for characterizing the probability function P: 


Axiom 1. Let A be any event defined over S. Then P(A) = 0. 


Axiom 2. P(S)=1. 


Axiom 3. Let A and B be any two mutually exclusive events defined over S. Then 


P(AUB) = P(A) + P(B) 


When S has an infinite number of members, a fourth axiom is needed: 


Axiom 4, Let Aj, Ao,.. 


., be events defined over S. If Ai Aj =9 for each i F j, then 


P (U 4) => P(A) 
i=1 i=l 


From these simple statements come the general rules for manipulating the probabil- 
ity function that apply no matter what specific mathematical form the function may 


take in a particular context. 


Some Basic Properties of P 


Some of the immediate consequences of Kolmogorov’s axioms are the results 
given in Theorems 2.3.1 through 2.3.6. Despite their simplicity, several of these 
properties —as we will soon see—prove to be immensely useful in solving all sorts 


of problems. 


28 Chapter 2 Probability 


Theorem 
2.3.1 


Theorem 
2.3.2 


Theorem 
2.3.3 


Theorem 
2.3.4 


Theorem 
2.3.5 


Theorem 
2.3.6 


P(AC)=1— P(A). 

Proof By Axiom 2 and Definition 2.2.3, 
P(S)=1=P(AUA‘) 

But A and A© are mutually exclusive, so 


P(AUA‘) = P(A) + P(A) 


and the result follows. 


P(G)=0. 
Proof Since = S°, P(#) = P(S©) =1— P(S)=0. 


If ACB, then P(A) < P(B). 
Proof Note that the event B may be written in the form 
B=AU(BNA‘) 
where A and (BN AC) are mutually exclusive. Therefore, 
P(B)= P(A) + P(BNA‘) 
which implies that P(B) > P(A) since P(BM AC) > 0. 


For any event A, P(A) <1. 


Proof The proof follows immediately from Theorem 2.3.3 because A C S and 
P(S)=1. 


Let Aj, Az, ..., An be events defined over S. If Ai Aj =% fori  j, then 
P (U 4) =U P(AD 
i=l i=l 


Proof The proof is a straightforward induction argument with Axiom 3 being the 
starting point. 


P(AUB) = P(A) + P(B) — P(ANB). 


Proof The Venn diagram for AU B certainly suggests that the statement of the 
theorem is true (recall Figure 2.2.4). More formally, we have from Axiom 3 that 


P(A) = P(AN B®) + P(ANB) 
and 
P(B) = P(BN A‘°)+ P(ANB) 
Adding these two equations gives 
P(A) + P(B)=[P(AN B®) + P(BN AS) + P(AN B)]+ P(ANB) 


By Theorem 2.3.5, the sum in the brackets is P(A U B). If we subtract P(AN B) from 
both sides of the equation, the result follows. 
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Example Let A and B be two events defined on a sample space S such that P(A) =0.3, P(B) = 
2.3.1 0.5, and P(AU B) =0.7. Find (a) P(AN B), (b) P(A© U B®), and (c) P(A© NB). 


a. Transposing the terms in Theorem 2.3.6 yields a general formula for the 
probability of an intersection: 


P(AN B)= P(A) + P(B) — P(AUB) 
Here 
P(AN B)=0.3+0.5 — 0.7 
=0.1 


b. The two cross-hatched regions in Figure 2.3.1 correspond to A© and B®. The 
union of A© and B© consists of those regions that have cross-hatching in either 
or both directions. By inspection, the only portion of S not included in AC U B© 
is the intersection, AN B. By Theorem 2.3.1, then, 


P(A® U B®) =1— P(ANB) 
=1-0.1 
=0.9 


Figure 2.3.1 


I] |s = 


A Co 


Figure 2.3.2 


c. The event A©/N B corresponds to the region in Figure 2.3.2 where the cross- 
hatching extends in both directions—that is, everywhere in B except the 
intersection with A. Therefore, 


P(AC 9 B) = P(B)— P(ANB) 
=0.5—0.1 
=0.4 = 
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Example 


2.3.2 


Example 


2.3.3 


Example 


2.3.4 


Show that 
P(ANB)>1— P(A‘) — P(BS) 


for any two events A and B defined on a sample space S. 
From Example 2.3.1a and Theorem 2.3.1, 


P(ANB)= P(A) + P(B)— P(AUB) 
=1— P(A°C)+1— P(B°) — P(AUB) 


But P(AU B) <1 from Theorem 2.3.4, so 


P(ANB)>1-— P(A‘) — P(B°) = 


Two cards are drawn from a poker deck without replacement. What is the probabil- 
ity that the second is higher in rank than the first? 

Let A,, Ao, and A3 be the events “First card is lower in rank,” “First card is 
higher in rank,” and “Both cards have same rank,” respectively. Clearly, the three 
A;’s are mutually exclusive and they account for all possible outcomes, so from 
Theorem 2.3.5, 


P(A, U Az U A3) = P(A1) + P(A2) + P(A3) = P(S) = 1 


Once the first card is drawn, there are three choices for the second that would have 
the same rank—that is, P(A3) = a. Moreover, symmetry demands that P(A,) = 
P(A2), SO 


3 
2P(A —=] 
(An) + 


implying that P(A2)= *. a 


In a newly released martial arts film, the actress playing the lead role has a stunt 
double who handles all of the physically dangerous action scenes. According to the 
script, the actress appears in 40% of the film’s scenes, her double appears in 30%, 
and the two of them are together 5% of the time. What is the probability that in a 
given scene, (a) only the stunt double appears and (b) neither the lead actress nor 
the double appears? 


a. If L is the event “Lead actress appears in scene” and D is the event “Double 
appears in scene,” we are given that P(L) =0.40, P(D) = 0.30, and P(LN D) = 
0.05. It follows that 


P(Only double appears) = P(D) — P(LN D) 
= 0.30 — 0.05 
=0.25 


(recall Example 2.3.1c). 


Example 
2.3.5 
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b. The event “Neither appears” is the complement of the event “At least one 
appears.” But P(At least one appears) = P(L U D). From Theorems 2.3.1 and 
2.3.6, then, 


P(Neither appears) = 1 — P(LU D) 
=1—[P(L)+ P(D) — P(LND)] 
= 1 — [0.40 + 0.30 — 0.05] 
= 0.35 r 


Having endured (and survived) the mental trauma that comes from taking two years 
of chemistry, a year of physics, and a year of biology, Biff decides to test the med- 
ical school waters and sends his MCATs to two colleges, X and Y. Based on how 
his friends have fared, he estimates that his probability of being accepted at X is 
0.7, and at Y is 0.4. He also suspects there is a 75% chance that at least one of 
his applications will be rejected. What is the probability that he gets at least one 
acceptance? 

Let A be the event “School X accepts him” and B the event “School Y accepts 
him.” We are given that P(A) =0.7, P(B) =0.4, and P(A© U B©) =0.75. The question 
is asking for P(A U B). 

From Theorem 2.3.6, 


P(AUB)= P(A) + P(B) — P(ANB) 
Recall from Question 2.2.32 that AT U BS = (AN B)°, so 
P(AN B)=1-— P[(AN B)°]= 1—0.75 = 0.25 


It follows that Biff’s prospects are not all that bleak—he has an 85% chance of 
getting in somewhere: 


P(AU B)=0.7+ 0.4 — 0.25 
= 0.85 = 


Comment Notice that P(A U B) varies directly with P(A© U B®): 
P(AU B) = P(A) + P(B) —[1— P(AS U B®)] 
= P(A) + P(B)—1+ P(AS UB*) 


If P(A) and P(B), then, are fixed, we get the curious result that Biff’s chances of get- 
ting at least one acceptance increase if his chances of at least one rejection increase. 


Questions 


2.3.1. According to a family-oriented lobbying group, 
there is too much crude language and violence on tele- 
vision. Forty-two percent of the programs they screened 
had language they found offensive, 27% were too violent, 
and 10% were considered excessive in both language and 
violence. What percentage of programs did comply with 
the group’s standards? 


2.3.2. Let A and B be any two events defined on S. 
Suppose that P(A) = 0.4, P(B) = 0.5, and P(AN B) = 
0.1. What is the probability that A or B but not both 
occur? 


2.3.3. Express the following probabilities in terms of 
P(A), P(B), and P(ANB). 
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(a) P(AC UBS) 
(b) P(ACN(AUB)) 


2.3.4. Let A and B be two events defined on S. If the 
probability that at least one of them occurs is 0.3 and the 
probability that A occurs but B does not occur is 0.1, what 
is P(B)? 


2.3.5. Suppose that three fair dice are tossed. Let A; be 
the event that a 6 shows on the ith die, i = 1, 2,3. Does 
P(A,U A) U A3) = 3? Explain. 


2.3.6. Events A and B are defined on a sample space S 
such that P((A U B)°) =0.5 and P(AN B) =0.2. What is 
the probability that either A or B but not both will occur? 


2.3.7. Let A,, A>,..., A, be a series of events for which 
Aj NA; =9% if i #j and A,UA,U---UA, =S. Let B 
be any event defined on S. Express B as a union of 
intersections. 


2.3.8. Draw the Venn diagrams that would correspond to 
the equations (a) P(AN B) = P(B) and (b) P(AU B) = 
P(B). 


2.3.9. In the game of “odd man out” each player tosses 
a fair coin. If all the coins turn up the same except for 
one, the player tossing the different coin is declared the 
odd man out and is eliminated from the contest. Suppose 
that three people are playing. What is the probability that 
someone will be eliminated on the first toss? (Hint: Use 
Theorem 2.3.1.) 


2.3.10. An urn contains twenty-four chips, numbered 1 
through 24. One is drawn at random. Let A be the event 
that the number is divisible by 2 and let B be the event 
that the number is divisible by 3. Find P(A U B). 


2.3.11. If State’s football team has a 10% chance of win- 
ning Saturday’s game, a 30% chance of winning two weeks 
from now, and a 65% chance of losing both games, what 
are their chances of winning exactly once? 


2.3.12. Events A, and A, are such that A; UA, =S and 
A al Ad = @. Find P2 if P(A}) =P, P(A2) =P, and 3p, = 
P2= s. 

2.3.13. Consolidated Industries has come under consid- 
erable pressure to eliminate its seemingly discriminatory 


hiring practices. Company officials have agreed that dur- 
ing the next five years, 60% of their new employees will be 
females and 30% will be minorities. One out of four new 
employees, though, will be a white male. What percentage 
of their new hires will be minority females? 


2.3.14. Three events—A, B, and C—are defined on a 
sample space, S. Given that P(A) = 0.2, P(B) = 0.1, and 
P(C)=0.3, what is the smallest possible value for P[(A U 
BUC)*]? 


2.3.15. A coin is to be tossed four times. Define events X 
and Y such that 


X: first and last coins have opposite faces 
Y: exactly two heads appear 


Assume that each of the sixteen head/tail sequences has 
the same probability. Evaluate 


(a) P(X°NY) 
(b) P(XNY°) 


2.3.16. Two dice are tossed. Assume that each possible 
outcome has a + probability. Let A be the event that the 
sum of the faces showing is 6, and let B be the event that 
the face showing on one die is twice the face showing on 
the other. Calculate P(AN B°). 


2.3.17. Let A, B, and C be three events defined on a sam- 
ple space, S. Arrange the probabilities of the following 
events from smallest to largest: 


(a) AUB 

(b) ANB 

(c) A 

(d) S 

(e) (ANB)U(ANC) 


2.3.18. Lucy is currently running two dot-com scams out 
of a bogus chatroom. She estimates that the chances of 
the first one leading to her arrest are one in ten; the “risk” 
associated with the second is more on the order of one 
in thirty. She considers the likelihood that she gets busted 
for both to be 0.0025. What are Lucy’s chances of avoiding 
incarceration? 


2.4 Conditional Probability 


In Section 2.3, we calculated probabilities of certain events by manipulating other 
probabilities whose values we were given. Knowing P(A), P(B), and P(AN B), for 
example, allows us to calculate P(A U B) (recall Theorem 2.3.6). For many real- 
world situations, though, the “given” in a probability problem goes beyond simply 
knowing a set of other probabilities. Sometimes, we know for a fact that certain 
events have already occurred, and those occurrences may have a bearing on the 


Figure 2.4.1 


Figure 2.4.2 
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probability we are trying to find. In short, the probability of an event A may have to 
be “adjusted” if we know for certain that some related event B has already occurred. 
Any probability that is revised to take into account the (known) occurrence of other 
events is said to be a conditional probability. 

Consider a fair die being tossed, with A defined as the event “6 appears.” Clearly, 
P(A) = i. But suppose that the die has already been tossed—by someone who 
refuses to tell us whether or not A occurred but does enlighten us to the extent 
of confirming that B occurred, where B is the event “Even number appears.” What 
are the chances of A now? Here, common sense can help us: There are three equally 
likely even numbers making up the event B—one of which satisfies the event A, so 
the “updated” probability is }. 

Notice that the effect of additional information, such as the knowledge that 
B has occurred, is to revise—indeed, to shrink—the original sample space S to a 
new set of outcomes S’. In this example, the original S contained six outcomes, the 
conditional sample space, three (see Figure 2.4.1). 


P (6, relative to S) = 1/6 P (6, relative to S') = 1/3 


The symbol P(A|B)—read “the probability of A given B”—is used to denote 
a conditional probability. Specifically, P(A|B) refers to the probability that A will 
occur given that B has already occurred. 

It will be convenient to have a formula for P(A|B) that can be evaluated in 
terms of the original S, rather than the revised S’. Suppose that S is a finite sample 
space with n outcomes, all equally likely. Assume that A and B are two events con- 
taining a and b outcomes, respectively, and let c denote the number of outcomes in 
the intersection of A and B (see Figure 2.4.2). Based on the argument suggested in 
Figure 2.4.1, the conditional probability of A given B is the ratio of c to b. But c/b 
can be written as the quotient of two other ratios, 


Op 


A B 


c c/n 
b~ b/n 
so, for this particular case, 
P(ANB) 
P(A|B) = ——— 2.4.1 
(A|B) P(B) (2.4.1) 


The same underlying reasoning that leads to Equation 2.4.1, though, holds true even 
when the outcomes are not equally likely or when S is uncountably infinite. 
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Example 


2.4.1 


Example 


2.4.2 


Definition 2.4.1. Let A and B be any two events defined on S such that 
P(B)>0. The conditional probability of A, assuming that B has already 
occurred, is written P(A|B) and is given by 

P(ANB) 


P(A|B)= P(B) 


Comment Definition 2.4.1 can be cross-multiplied to give a frequently useful 
expression for the probability of an intersection. If P(A|B) = P(AN B)/P(B), 
then 


P(AN B) = P(A|B)P(B) (2.4.2) 


A card is drawn from a poker deck. What is the probability that the card is a club, 
given that the card is a king? 

Intuitively, the answer is i The king is equally likely to be a heart, diamond, 
club, or spade. More formally, let C be the event “Card is a club”; let K be the event 


“Card is a king.” By Definition 2.4.1, 


P(CNK) 


But P(K)= 4 and P(CN K)= P(Card is a king of clubs) = +. Therefore, confirming 
our intuition, 
P(C|K)= cee = : 
4/52. 4 
[Notice in this example that the conditional probability P(C|K) is numerically 
the same as the unconditional probability P(C)—they both equal i This means 
that our knowledge that K has occurred gives us no additional insight about the 
chances of C occurring. Two events having this property are said to be indepen- 
dent. We will examine the notion of independence and its consequences in detail in 
Section 2.5.] = 


Our intuitions can often be fooled by probability problems, even ones that appear to 
be simple and straightforward. The “two boys” problem described here is an often- 
cited case in point. 

Consider the set of families having two children. Assume that the four possible 
birth sequences—(younger child is a boy, older child is a boy), (younger child is a 
boy, older child is a girl), and so on—are equally likely. What is the probability that 
both children are boys given that at least one is a boy? 

The answer is not s. The correct answer can be deduced from Definition 2.4.1. 
By assumption, each of the four possible birth sequences— (b, b), (b, g), (g,b), and 
(g,g)—has a } probability of occurring. Let A be the event that both children are 
boys, and let B be the event that at least one child is a boy. Then 


P(A|B) = P(AN B)/P(B) = P(A)/P(B) 
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since A is a subset of B (so the overlap between A and B is just A). But A 
has one outcome {(b, b)} and B has three outcomes {(J, g), (g, b), (b, b)}. Applying 
Definition 2.4.1, then, gives 


1 
RAE) Sas 


Another correct approach is to go back to the sample space and deduce the 
value of P(A|B) from first principles. Figure 2.4.3 shows events A and B defined 
on the four family types that comprise the sample space S$. Knowing that B has 
occurred redefines the sample space to include three outcomes, each now having a 
5 probability. Of those three possible outcomes, one—namely, (b, b) —satisfies the 


event A. It follows that P(A|B) = §. 


©), 


S = sample space of two-child families 
[outcomes written as (first born, second born)] 


Figure 2.4.3 = 


Example Two events A and B are defined such that (1) the probability that A occurs but B 
2.4.3 does not occur is 0.2, (2) the probability that B occurs but A does not occur is 0.1, 
and (3) the probability that neither occurs is 0.6. What is P(A|B)? 
The three events whose probabilities are given are indicated on the Venn 
diagram shown in Figure 2.4.4. Since 


P (Neither occurs) = 0.6 = P((AU B)°) 
it follows that 
P(AUB)=1-—0.6=0.4= P(AN B®) + P(ANB)+ P(BNAS) 
SO 


P(ANB)=0.4—0.2—0.1 
=0.1 


Es I— Bm Ac 


A |_ Neither A nor B 


Figure 2.4.4 
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Example 


2.4.4 


Example 
2.4.5 


From Definition 2.4.1, then, 


ne P(ANB) _ P(ANB) 
P(B) P(ANB)+ P(BN AC) 
0. 
~ 0.1+0.1 
=0.5 = 


The possibility of importing liquified natural gas (LNG) from Algeria has been sug- 
gested as one way of coping with a future energy crunch. Complicating matters, 
though, is the fact that LNG is highly volatile and poses an enormous safety hazard. 
Any major spill occurring near a U.S. port could result in a fire of catastrophic pro- 
portions. The question, therefore, of the likelihood of a spill becomes critical input 
for future policymakers who may have to decide whether or not to implement the 
proposal. 

Two numbers need to be taken into account: (1) the probability that a tanker will 
have an accident near a port, and (2) the probability that a major spill will develop 
given that an accident has happened. Although no significant spills of LNG have 
yet occurred anywhere in the world, these probabilities can be approximated from 
records kept on similar tankers transporting less dangerous cargo. On the basis of 
such data, it has been estimated (42) that the probability is 8/50,000 that an LNG 
tanker will have an accident on any one trip. Given that an accident has occurred, it 
is suspected that only three times in fifteen thousand will the damage be sufficiently 
severe that a major spill would develop. What are the chances that a given LNG 
shipment would precipitate a catastrophic disaster? 

Let A denote the event “Spill develops” and let B denote the event “Accident 
occurs.” Past experience is suggesting that P(B) = 8/50,000 and P(A|B) =3/15,000. 
Of primary concern is the probability that an accident will occur and a spill will 
ensue—that is, P(ANM B). Using Equation 2.4.2, we find that the chances of a 
catastrophic accident are on the order of three in one hundred million: 


P (Accident occurs and spill develops) = P(AN B) 
= P(A|B)P(B) 
a, 8 
15,000 50,000 
= 0.000000032 a 


Max and Muffy are two myopic deer hunters who shoot simultaneously at a nearby 
sheepdog that they have mistaken for a 10-point buck. Based on years of well- 
documented ineptitude, it can be assumed that Max has a 20% chance of hitting 
a stationary target at close range, Muffy has a 30% chance, and the probability is 
0.06 that they will both be on target. Suppose that the sheepdog is hit and killed by 
exactly one bullet. What is the probability that Muffy fired the fatal shot? 

Let A be the event that Max hit the dog, and let B be the event that Muffy hit 
the dog. Then P(A) =0.2, P(B) =0.3, and P(AN B) =0.06. We are trying to find 


P(B\(A© N B)U(AN B®)) 


where the event (A° NB) U(ANB‘*) is the union of A and B minus the intersection— 
that is, it represents the event that either A or B but not both occur (recall 
Figure 2.4.4). 


Example 
2.4.6 
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Notice, also, from Figure 2.4.4 that the intersection of B and (A©N B)U(AN B°) 
is the event A©  B. Therefore, from Definition 2.4.1, 


P(B\(AS N B)U(ANB*)) =[P(AEN B)]/[P{(AS NB) U(AN B®)}] 
=[P(B) — P(ANB)]/[P(AUB) — P(ANB)] 
= [0.3 — 0.06]/[0.2 + 0.3 — 0.06 — 0.06] 
= 0.63 = 


The highways connecting two resort areas at A and B are shown in Figure 2.4.5. 
There is a direct route through the mountains and a more circuitous route going 
through a third resort area at C in the foothills. Travel between A and B during the 
winter months is not always possible, the roads sometimes being closed due to snow 
and ice. Suppose we let E), E2, and £3 denote the events that highways AB, AC, 
and BC are passable, respectively, and we know from past years that on a typical 
winter day, 


A Ey 
B 
E, E; 
C 
Figure 2.4.5 
P(E) =" P(E) =" ee 
1 ~ 5? 2 ae 3 ~ 3 


and 


4 1 
P(£3|E2) = 3° P(E) |E.0 £3) = 5 


What is the probability that a traveler will be able to get from A to B? 
If E denotes the event that we can get from A to B, then 


E=E,U(E.ME3) 
It follows that 

P(E) = P(E) + P( E22 Es) — PLE, (E220 E3)] 
Applying Equation 2.4.2 three times gives 


P(E) = P(E)) + P(E3| Ex) P(E2) — PLE, |(E2 9 E3)] P(E. E3) 
= P(E\) + P(E3|E2)P(E2) — PLE, |(E2 E3)]P(E3| E2) P(E2) 


=3+(5)G)-()(@)G)- 


(Which route should a traveler starting from A try first to maximize the chances of 
getting to B?) = 
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Case Study 2.4.1 


Several years ago, a television program (inadvertently) spawned a conditional 
probability problem that led to more than a few heated discussions, even in the 
national media. The show was Let’s Make a Deal, and the question involved 
the strategy that contestants should take to maximize their chances of winning 
prizes. 

On the program, a contestant would be presented with three doors, behind 
one of which was the prize. After the contestant had selected a door, the host, 
Monty Hall, would open one of the other two doors, showing that the prize 
was not there. Then he would give the contestant a choice—either stay with 
the door initially selected or switch to the “third” door, which had not been 
opened. 

For many viewers, common sense seemed to suggest that switching doors 
would make no difference. By assumption, the prize had a one-third chance 
of being behind each of the doors when the game began. Once a door was 
opened, it was argued that each of the remaining doors now had a one-half 
probability of hiding the prize, so contestants gained nothing by switching their 
bets. 

Not so. An application of Definition 2.4.1 shows that it did make a 
difference —contestants, in fact, doubled their chances of winning by switching 
doors. To see why, consider a specific (but typical) case: the contestant has bet 
on Door #2 and Monty Hall has opened Door #3. Given that sequence of events, 
we need to calculate and compare the conditional probability of the prize being 
behind Door #1 and Door #2, respectively. If the former is larger (and we will 
prove that it is), the contestant should switch doors. 

Table 2.4.1 shows the sample space associated with the scenario just 
described. If the prize is actually behind Door #1, the host has no choice but 
to open Door #3; similarly, if the prize is behind Door #3, the host has no choice 
but to open Door #1. In the event that the prize is behind Door #2, though, 
the host would (theoretically) open Door #1 half the time and Door #3 half the 
time. 


Table 2.4.1 


(Prize Location, Door Opened) Probability 


(1, 3) 1/3 
(2, 1) 1/6 
(2, 3) 1/6 
(3, 1) 1/3 


Notice that the four outcomes in S are not equally likely. There is neces- 
sarily a one-third probability that the prize is behind each of the three doors. 
However, the two choices that the host has when the prize is behind Door #2 
necessitate that the two outcomes (2, 1) and (2, 3) share the one-third probabil- 
ity that represents the chances of the prize being behind Door #2. Each, then, 
has the one-sixth probability listed in Table 2.4.1. 


(Continued on next page) 
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Let A be the event that the prize is behind Door #2, and let B be the event 
that the host opened Door #3. Then 


P(A|B) = P(Contestant wins by not switching) = 


Now, let A* be the event that the prize is behind Door # 
be the event that the host opens Door #53. In this case, 


P(A*|B) = P(Contestant wins by switching) = 


Common sense would have led us astray again! If given the choice, contestants 
should have always switched doors. Doing so upped their chances of winning 
from one-third to two-thirds. 


[P(AN B)]/P(B) 


ol/I3 +] 


1 
3 
1, 


and let B (as before) 


[P(A* 1M B)]/P(B) 


L/L +6] 


Questions 


2.4.1. Suppose that two fair dice are tossed. What is the 
probability that the sum equals 10 given that it exceeds 8? 


2.4.2. Find P(AN B) if P(A) = 0.2, P(B) = 0.4, and 
P(A|B) + P(B|A) =0.75. 


2.4.3. If P(A|B) < P(A), show that P(B|A) < P(B). 


2.4.4. Let A and B be two events such that P((A U B)°) = 
0.6 and P(AN B)=0.1. Let E be the event that either A or 
B but not both will occur. Find P(E|AU B). 


2.4.5. Suppose that in Example 2.4.2 we ignored the ages 
of the children and distinguished only three family types: 
(boy, boy), (girl, boy), and (girl, girl). Would the condi- 
tional probability of both children being boys given that 
at least one is a boy be different from the answer found on 
p. 35? Explain. 


2.4.6. Two events, A and B, are defined on a sample space 
S such that P(A|B) = 0.6, P(At least one of the events 
occurs) = 0.8, and P(Exactly one of the events occurs) 
=(.6. Find P(A) and P(B). 


2.4.7. An urn contains one red chip and one white chip. 
One chip is drawn at random. If the chip selected is red, 
that chip together with two additional red chips are put 
back into the urn. If a white chip is drawn, the chip is 
returned to the urn. Then a second chip is drawn. What 
is the probability that both selections are red? 


2.4.8. Given that P(A) =a and P(B) =), show that 


b-1 
rams 


2.4.9. An urn contains one white chip and a second chip 
that is equally likely to be white or black. A chip is drawn 
at random and returned to the urn. Then a second chip is 
drawn. What is the probability that a white appears on the 
second draw given that a white appeared on the first draw? 
(Hint: Let W; be the event that a white chip is selected 
on the ith draw, i = 1, 2. Then P(W>|W,) = Se If 
both chips in the urn are white, P(W,) = 1; otherwise, 
P(Wi)=3-) 


2.4.10. Suppose events A and B are such that P(AN B) = 
0.1 and P((AU B)°) =0.3. If P(A) =0.2, what does P[(AN 
B)|(A U B)°] equal? (Hint: Draw the Venn diagram.) 


2.4.11. One hundred voters were asked their opinions 
of two candidates, A and B, running for mayor. Their 
responses to three questions are summarized below: 


Number Saying “Yes” 


Do you like A? 65 
Do you like B? 55 
Do you like both? 25 


(a) What is the probability that someone likes neither? 

(b) What is the probability that someone likes exactly 
one? 

(c) What is the probability that someone likes at least 
one? 

(d) What is the probability that someone likes at most 
one? 
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(e) What is the probability that someone likes exactly 
one given that he or she likes at least one? 

(f) Of those who like at least one, what proportion like 
both? 

(g) Of those who do not like A, what proportion like B? 


2.4.12. A fair coin is tossed three times. What is the prob- 
ability that at least two heads will occur given that at most 
two heads have occurred? 


2.4.13. Two fair dice are rolled. What is the probability 
that the number on the first die was at least as large as 4 
given that the sum of the two dice was 8? 


2.4.14. Four cards are dealt from a standard 52-card 
poker deck. What is the probability that all four are aces 
given that at least three are aces? (Note: There are 270,725 
different sets of four cards that can be dealt. Assume 
that the probability associated with each of those hands 
is 1/270,725.) 


2.4.15. Given that P(AM B©)=0.3, P((AU B)°) =0.2, and 
P(AN B)=0.1, find P(A|B). 


2.4.16. Given that P(A) + P(B)=0.9, P(A|B) =0.5, and 
P(B|A) =0.4, find P(A). 


2.4.17. Let A and B be two events defined on a sample 
space S such that P(ANM B°) =0.1, P(A© NM B) = 0.3, and 
P((AUB)°) =0.2. Find the probability that at least one of 
the two events occurs given that at most one occurs. 


2.4.18. Suppose two dice are rolled. Assume that each 
possible outcome has probability 1/36. Let A be the event 
that the sum of the two dice is greater than or equal to 8, 
and let B be the event that at least one of the dice shows a 
5. Find P(A|B). 


2.4.19. According to your neighborhood bookie, five 
horses are scheduled to run in the third race at the local 
track, and handicappers have assigned them the following 
probabilities of winning: 


Horse Probability of Winning 
Scorpion 0.10 
Starry Avenger 0.25 
Australian Doll 0.15 
Dusty Stake 0.30 
Outandout 0.20 


Suppose that Australian Doll and Dusty Stake are 
scratched from the race at the last minute. What are the 
chances that Outandout will prevail over the reduced 
field? 


2.4.20. Andy, Bob, and Charley have all been serving 
time for grand theft auto. According to prison scuttle- 
butt, the warden plans to release two of the three next 
week. They all have identical records, so the two to be 
released will be chosen at random, meaning that each has 
a two-thirds probability of being included in the two to 
be set free. Andy, however, is friends with a guard who 
will know ahead of time which two will leave. He offers 
to tell Andy the name of one prisoner other than him- 
self who will be released. Andy, however, declines the 
offer, believing that if he learns the name of one pris- 
oner scheduled to be released, then his chances of being 
the other person set free will drop to one-half (since only 
two prisoners will be left at that point). Is his concern 
justified? 


Applying Conditional Probability to Higher-Order Intersections 


We have seen that conditional probabilities can be useful in evaluating intersec- 
tion probabilities—that is, P(AN B) = P(A|B)P(B) = P(B|A) P(A). A similar result 
holds for higher-order intersections. Consider P(AN BNC). By thinking of AN B as 
a single event—say, D—we can write 


P(ANBNC)=P(DNC) 


Repeating this same argument for n events, A, A2,.. 


general case: 


= P(C|D)P(D) 
= P(C|ANB)P(ANB) 
= P(C|AN B)P(BIA) P(A) 


., An, gives a formula for the 


P(A, NAaN++*A An) = P(An|A1 Ag += An—1) 


»P(Ag-1|A1 Aa ++ An-2) ++ P(A2|A1)- P(A1) 
(2.4.3) 
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Example An urn contains five white chips, four black chips, and three red chips. Four chips are 
2.4.7 drawn sequentially and without replacement. What is the probability of obtaining 
the sequence (white, red, white, black)? 
Figure 2.4.6. shows the evolution of the urn’s composition as the desired 
sequence is assembled. Define the following four events: 


® )”( ®@ )”( @ px ( © )» 
— 4B — 4B — 4B — 3B 
3R 2R 2R 2R 


Figure 2.4.6 


A: white chip is drawn on first selection 
B: red chip is drawn on second selection 
C: white chip is drawn on third selection 
D: black chip is drawn on fourth selection 


Our objective is to find PCAN BNCND). 
From Equation 2.4.3, 


P(ANBNCND)=P(DIANBNC)- P(C|AN B)- P(BJA)- P(A) 


Each of the probabilities on the right-hand side of the equation here can be gotten 


by just looking at the urns pictured in Figure 2.4.6: P(DJANBNC)= 3 P(C\|ANB)= 


> P(B|A)= * and P(A)= +: Therefore, the probability of drawing a (white, red, 


white, black) sequence is 0.02: 


P(ANBNCND)= 


ols 


Case Study 2.4.2 


Since the late 1940s, tens of thousands of eyewitness accounts of strange lights 
in the skies, unidentified flying objects, and even alleged abductions by little 
green men have made headlines. None of these incidents, though, has produced 
any hard evidence, any irrefutable proof that Earth has been visited by a race 
of extraterrestrials. Still, the haunting question remains—are we alone in the 
universe? Or are there other civilizations, more advanced than ours, making 
the occasional flyby? 

Until, or unless, a flying saucer plops down on the White House lawn and a 
strange-looking creature emerges with the proverbial “Take me to your leader” 
demand, we may never know whether we have any cosmic neighbors. Equa- 
tion 2.4.3, though, can help us speculate on the probability of our not being 
alone. 


(Continued on next page) 
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(Case Study 2.4.2 continued) 


Recent discoveries suggest that planetary systems much like our own may 
be quite common. If so, there are likely to be many planets whose chemical 
makeups, temperatures, pressures, and so on, are suitable for life. Let those 
planets be the points in our sample space. Relative to them, we can define three 
events: 


A: life arises 
B: technical civilization arises (one capable of interstellar communication) 
C: technical civilization is flourishing now 


In terms of A, B, and C, the probability that a habitable planet is presently sup- 
porting a technical civilization is the probability of an intersection —specifically, 
P(AN BNC). Associating a number with P(AN BMC) is highly problematic, 
but the task is simplified considerably if we work instead with the equivalent 
conditional formula, P(C|BM A)- P(B|A)- P(A). 

Scientists speculate (153) that life of some kind may arise on one-third of 
all planets having a suitable environment and that life on maybe 1% of all those 
planets will evolve into a technical civilization. In our notation, P(A) = ‘ and 
P(BIA)=7. 

More difficult to estimate is P(C|AN B). On Earth, we have had the capa- 
bility of interstellar communication (that is, radio astronomy) for only a few 
decades, so P(C|A/N B), empirically, is on the order of 1 x 107°. But that may 
be an overly pessimistic estimate of a technical civilization’s ability to endure. It 
may be true that if a civilization can avoid annihilating itself when it first devel- 
ops nuclear weapons, its prospects for longevity are fairly good. If that were the 
case, P(C|ANM B) might be as large as 1 x 107. 

Putting these estimates into the computing formula for P(AN BNC) yields 
a range for the probability of a habitable planet currently supporting a technical 
civilization. The chances may be as small as 3.3 x 107!! or as “large” as 3.3 x 
107°: 


( 10% ( : )G) P(ANBNC)<( 10 ( 2 )G) 
x T00 3 < < x 100 3 


0.000000000033 < P(AN BNC) < 0.000033 


or 


A better way to put these figures in some kind of perspective is to think 
in terms of numbers rather than probabilities. Astronomers estimate there are 
3 x 10!' habitable planets in our Milky Way galaxy. Multiplying that total by 
the two limits for P(AN BMC) gives an indication of how many cosmic neigh- 
bors we are likely to have. Specifically, 3 x 10'! - 0.000000000033 = 10, while 
3 x 10!! . 0.000033 = 10,000,000. So, on the one hand, we may be a galactic rar- 
ity. At the same time, the probabilities do not preclude the very real possibility 
that the Milky Way is abuzz with activity and that our neighbors number in the 
millions. 


Questions 


2.4.21. An urn contains six white chips, four black chips, 
and five red chips. Five chips are drawn out, one at a 
time and without replacement. What is the probability 
of getting the sequence (black, black, red, white, white)? 
Suppose that the chips are numbered 1 through 15. What 
is the probability of getting a specific sequence —say, (2, 6, 
4,9, 13)? 


2.4.22. A man has n keys on a key ring, one of which 
opens the door to his apartment. Having celebrated a bit 
too much one evening, he returns home only to find him- 
self unable to distinguish one key from another. Resource- 
ful, he works out a fiendishly clever plan: He will choose 
a key at random and try it. If it fails to open the door, he 
will discard it and choose at random one of the remaining 
n—1 keys, and so on. Clearly, the probability that he gains 
entrance with the first key he selects is 1/n. Show that the 


2.4 Conditional Probability 43 


probability the door opens with the third key he tries is 
also 1/n. (Hint: What has to happen before he even gets 
to the third key?) 


2.4.23. Suppose that four cards are drawn from a stan- 
dard 52-card poker deck. What is the probability of draw- 
ing, in order, a 7 of diamonds, a jack of spades, a 10 of 
diamonds, and a 5 of hearts? 


2.4.24. One chip is drawn at random from an urn that 
contains one white chip and one black chip. If the white 
chip is selected, we simply return it to the urn; if the black 
chip is drawn, that chip—together with another black— 
are returned to the urn. Then a second chip is drawn, 
with the same rules for returning it to the urn. Calculate 
the probability of drawing two whites followed by three 
blacks. 


Calculating “Unconditional” and “Inverse” Probabilities 


We conclude this section with two very useful theorems that apply to partitioned 


sample spaces. By definition, a set of events A,, Ao,. 


.., A, “partition” S if every 


outcome in the sample space belongs to one and only one of the A;’s—that is, the 
A;’s are mutually exclusive and their union is S (see Figure 2.4.7). 


A B 


Figure 2.4.7 


Let B, as pictured, denote any event defined on S. The first result, Theo- 
rem 2.4.1, gives a formula for the “unconditional” probability of B (in terms of the 
A;’s). Then Theorem 2.4.2 calculates the set of conditional probabilities, P(A ;|B), 


JH ly 2 eet 


Theorem 
2.4.1 


Let {A;}"_, be a set of events defined over S such that S =\|)7_, Ai, Ai Aj =9 for 
i#j, and P(A;)>0 fori=1,2,...,n. For any event B, 


P(B)=)_ P(BIA;) P(A;) 


i=1 


Proof By the conditions imposed on the A;’s, 


B=(BNA})U(BN Az) U+*+U(BN An) 


and 


P(B) = P(BNA,)+ P(BN Az) +--+ + P(BN An) 
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Example 


2.4.8 


Example 
2.4.9 


But each P(BM A;) can be written as the product P(B|A;)P(A;), and the result 
follows. 


Urn I contains two red chips and four white chips; urn II, three red and one white. A 
chip is drawn at random from urn I and transferred to urn II. Then a chip is drawn 
from urn II. What is the probability that the chip drawn from urn II is red? 

Let B be the event “Chip drawn from urn II is red”; let A; and A> be the events 
“Chip transferred from urn I is red” and “Chip transferred from urn I is white,” 
respectively. By inspection (see Figure 2.4.8), we can deduce all the probabilities 
appearing in the right-hand side of the formula in Theorem 2.4.1: 


Transfer 


one 


——— Draw one 


Figure 2.4.8 
Pe=— P(B\A2) => 
wo 5 os 
Puge- ene 
cae eG 


Putting all this information together, we see that the chances are two out of three 
that a red chip will be drawn from urn I: 


P(B) = P(B\A)) P(Ai) + P(BIA2) P(A2) 


A standard poker deck is shuffled and the card on top is removed. What is the 
probability that the second card is an ace? 
Define the following events: 


B: second card is an ace 
A,: top card was an ace 
Az: top card was not an ace 


Then P(B|A\) = 4, P(B|A2) = 4, P(A1) = 3, and P(A2) = &. Since the A;’s par- 
tition the sample space of two-card selections, Theorem 2.4.1 applies. Substituting 
into the expression for P(B) shows that 4 is the probability that the second card is 


an ace: 


P(B) = P(B\Ai) P(Ai) + P(BIA2) P(A2) 


8 ace 48 
~ 51 52 51 52 
4 


~ 52 


Example 
2.4.10 


Example 
2.4.11 
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Comment Notice that P(B) = P(2nd card is an ace) is numerically the same as 
P(A)) = P(first card is an ace). The analysis in Example 2.4.9 illustrates a basic prin- 
ciple in probability that says, in effect, “What you don’t know, doesn’t matter.” Here, 
removal of the top card is irrelevant to any subsequent probability calculations if the 
identity of that card remains unknown. = 


Ashley is hoping to land a summer internship with a public relations firm. If her 
interview goes well, she has a 70% chance of getting an offer. If the interview is 
a bust, though, her chances of getting the position drop to 20%. Unfortunately, 
Ashley tends to babble incoherently when she is under stress, so the likelihood of 
the interview going well is only 0.10. What is the probability that Ashley gets the 
internship? 

Let B be the event “Ashley is offered internship,” let A, be the event “Interview 
goes well,” and let Az be the event “Interview does not go well.” By assumption, 


P(B|A,) =0.70 P(B\A2) =0.20 
P(A,) =0.10 P(A2) =1-— P(A,;)=1-—0.10=0.90 


According to Theorem 2.4.1, Ashley has a 25% chance of landing the internship: 


P(B)= P(B\A\)P(A1) + P(B/A2) P(A2) 
= (0.70)(0.10) + (0.20) (0.90) 
= 0.25 @ 


In an upstate congressional race, the incumbent Republican (R) is running against 
a field of three Democrats (D,, D2, and D3) seeking the nomination. Political pun- 
dits estimate that the probabilities of D,, D2, or D3; winning the primary are 0.35, 
0.40, and 0.25, respectively. Furthermore, results from a variety of polls are sug- 
gesting that R would have a 40% chance of defeating D, in the general election, a 
35% chance of defeating D2, and a 60% chance of defeating D3. Assuming all these 
estimates to be accurate, what are the chances that the Republican will retain his 
seat? 

Let B denote the event that “R wins general election,” and let A; denote the 
event “D; wins Democratic primary,” i = 1, 2,3. Then 


P(A,) =0.35 P(A2) = 0.40 P(A3) =0.25 

and 
P(B|A,1) =0.40 P(B|A2) = 0.35 P(B|A3) = 0.60 
so 
P(B) = P(Republican wins general election) 

= P(B|A\)P(A1) + P(BI|A2) P(Az) + P(BIA3) P(A3) 

= (0.40) (0.35) + (0.35) (0.40) + (0.60) (0.25) 

= 0.43 i 
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Example 


2.4.12 


Three chips are placed in an urn. One is red on both sides, a second is blue on both 
sides, and the third is red on one side and blue on the other. One chip is selected at 
random and placed on a table. Suppose that the color showing on that chip is red. 
What is the probability that the color underneath is also red (see Figure 2.4.9)? 


Figure 2.4.9 


At first glance, it may seem that the answer is one-half: We know that the 
blue/blue chip has not been drawn, and only one of the remaining two—the red/red 
chip—satisfies the event that the color underneath is red. If this game were played 
over and over, though, and records were kept of the outcomes, it would be found 
that the proportion of times that a red top has a red bottom is two-thirds, not 
the one-half that our intuition might suggest. The correct answer follows from an 
application of Theorem 2.4.1. 

Define the following events: 


A: bottom side of chip drawn is red 
B: top side of chip drawn is red 

A,: red/red chip is drawn 

Az: blue/blue chip is drawn 

A3: red/blue chip is drawn 


From the definition of conditional probability, 


P(ANB) 


P(A|B) = er 


But P(AN B) = P(Both sides are red) = P(red/red chip) = i Theorem 2.4.1 can be 
used to find the denominator, P(B): 


P(B) = P(B|A\)P(Ai) + P(B|Az) P(A) + P(BIA3) P(A3) 


fla att 
3 3 2 3 
1 
= 
Therefore, 
pummel Ss" 
1/2 3 


Comment The question posed in Example 2.4.12 gives rise to a simple but effective 
con game. The trick is to convince a “mark” that the initial analysis given above is 
correct, meaning that the bottom has a fifty-fifty chance of being the same color as 
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the top. Under that incorrect presumption that the game is “fair,” both participants 
put up the same amount of money, but the gambler (knowing the correct analysis) 
always bets that the bottom is the same color as the top. In the long run, then, the 
con artist will be winning an even-money bet two-thirds of the time! = 


Questions 


2.4.25. A toy manufacturer buys ball bearings from three 
different suppliers—50% of her total order comes from 
supplier 1, 30% from supplier 2, and the rest from sup- 
plier 3. Past experience has shown that the quality-control 
standards of the three suppliers are not all the same. Two 
percent of the ball bearings produced by supplier 1 are 
defective, while suppliers 2 and 3 produce defective bear- 
ings 3% and 4% of the time, respectively. What proportion 
of the ball bearings in the toy manufacturer’s inventory 
are defective? 


2.4.26. A fair coin is tossed. If a head turns up, a fair die 
is tossed; if a tail turns up, two fair dice are tossed. What 
is the probability that the face (or the sum of the faces) 
showing on the die (or the dice) is equal to 6? 


2.4.27. Foreign policy experts estimate that the probabil- 
ity is 0.65 that war will break out next year between two 
Middle East countries if either side significantly escalates 
its terrorist activities. Otherwise, the likelihood of war is 
estimated to be 0.05. Based on what has happened this 
year, the chances of terrorism reaching a critical level in 
the next twelve months are thought to be three in ten. 
What is the probability that the two countries will go 
to war? 


2.4.28. A telephone solicitor is responsible for canvassing 
three suburbs. In the past, 60% of the completed calls to 
Belle Meade have resulted in contributions, compared to 
55% for Oak Hill and 35% for Antioch. Her list of tele- 
phone numbers includes one thousand households from 
Belle Meade, one thousand from Oak Hill, and two thou- 
sand from Antioch. Suppose that she picks a number at 
random from the list and places the call. What is the 
probability that she gets a donation? 


2.4.29. If men constitute 47% of the population and tell 
the truth 78% of the time, while women tell the truth 63% 
of the time, what is the probability that a person selected 
at random will answer a question truthfully? 


2.4.30. Urn I contains three red chips and one white 
chip. Urn IT contains two red chips and two white chips. 
One chip is drawn from each urn and transferred to the 
other urn. Then a chip is drawn from the first urn. What is 
the probability that the chip ultimately drawn from urn I 
is red? 


2.4.31. Medical records show that 0.01% of the general 
adult population not belonging to a high-risk group (for 


example, intravenous drug users) are HIV-positive. Blood 
tests for the virus are 99.9% accurate when given to some- 
one infected and 99.99% accurate when given to someone 
not infected. What is the probability that a random adult 
not in a high-risk group will test positive for the HIV 
virus? 


2.4.32. Recall the “survival” lottery described in Ques- 
tion 2.2.14. What is the probability of release associated 
with the prisoner’s optimal strategy? 


2.4.33. State College is playing Backwater A&M for the 
conference football championship. If Backwater’s first- 
string quarterback is healthy, A&M has a 75% chance of 
winning. If they have to start their backup quarterback, 
their chances of winning drop to 40%. The team physician 
says that there is a 70% chance that the first-string quar- 
terback will play. What is the probability that Backwater 
wins the game? 


2.4.34. An urn contains forty red chips and sixty white 
chips. Six chips are drawn out and discarded, and a seventh 
chip is drawn. What is the probability that the seventh chip 
is red? 


2.4.35. A study has shown that seven out of ten people 
will say “heads” if asked to call a coin toss. Given that 
the coin is fair, though, a head occurs, on the average, 
only five times out of ten. Does it follow that you have 
the advantage if you let the other person call the toss? 
Explain. 


2.4.36. Based on pretrial speculation, the probability that 
a jury returns a guilty verdict in a certain high-profile 
murder case is thought to be 15% if the defense can 
discredit the police department and 80% if they cannot. 
Veteran court observers believe that the skilled defense 
attorneys have a 70% chance of convincing the jury that 
the police either contaminated or planted some of the key 
evidence. What is the probability that the jury returns a 
guilty verdict? 


2.4.37. As an incoming freshman, Marcus believes that he 
has a 25% chance of earning a GPA in the 3.5 to 4.0 range, 
a 35% chance of graduating with a 3.0 to 3.5 GPA, anda 
40% chance of finishing with a GPA less than 3.0. From 
what the pre-med advisor has told him, Marcus has an 8 
in 10 chance of getting into medical school if his GPA is 
above 3.5, a 5 in 10 chance if his GPA is in the 3.0 to 3.5 
range, and only a 1 in 10 chance if his GPA falls below 
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3.0. Based on those estimates, what is the probability that 
Marcus gets into medical school? 


2.4.38. The governor of a certain state has decided to 
come out strongly for prison reform and is preparing a 
new early release program. Its guidelines are simple: pris- 
oners related to members of the governor’s staff would 
have a 90% chance of being released early; the probability 
of early release for inmates not related to the governor’s 
staff would be 0.01. Suppose that 40% of all inmates are 
related to someone on the governor’s staff. What is the 
probability that a prisoner selected at random would be 
eligible for early release? 


2.4.39. Following are the percentages of students of State 
College enrolled in each of the school’s main divisions. 


Also listed are the proportions of students in each division 
who are women. 


Division % % Women 
Humanities 40 60 
Natural science 10 15 
History 30 45 
Social science 20 75 

100 


Suppose the registrar selects one person at random. What 
is the probability that the student selected will be a male? 


Theorem 
2.4.2 


Bayes’ Theorem 


The second result in this section that is set against the backdrop of a partitioned sam- 
ple space has a curious history. The first explicit statement of Theorem 2.4.2, coming 
in 1812, was due to Laplace, but it was named after the Reverend Thomas Bayes, 
whose 1763 paper (published posthumously) had already outlined the result. On 
one level, the theorem is a relatively minor extension of the definition of conditional 
probability. When viewed from a loftier perspective, though, it takes on some rather 
profound philosophical implications. The latter, in fact, have precipitated a schism 
among practicing statisticians: “Bayesians” analyze data one way; “non-Bayesians” 
often take a fundamentally different approach (see Section 5.8). 

Our use of the result here will have nothing to do with its statistical interpre- 
tation. We will apply it simply as the Reverend Bayes originally intended, as a 
formula for evaluating a certain kind of “inverse” probability. If we know P(B|A;) 
for all i, the theorem enables us to compute conditional probabilities “in the other 
direction” — that is, we can deduce P(A;|B) from the P(B|A;)’s. 


(Bayes’) Let {A;}?_, be a set of n events, each with positive probability, that partition S 
in such a way that U?_, Aj = S and Aj Aj =9 fori 4 j. For any event B (also defined 
on S), where P(B) > 0, 


P(BIA;)P(A;) 
3? P(BIA,) P(A;) 


i=1 


P(A;|B)= 


forany\<j <n. 
Proof From Definition 2.4.1, 


P(A)|B) = P(A; B) _ P(B\Aj)P(Aj) 
tee = PEBY P(B) 


But Theorem 2.4.1 allows the denominator to be written as )> P(B|A;)P(A;), and 


i=1 


the result follows. 
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Problem-Solving Hints 


(Working with Partitioned Sample Spaces) 


Students sometimes have difficulty setting up problems that involve partitioned 
sample spaces—in particular, ones whose solution requires an application of 
either Theorem 2.4.1 or 2.4.2—because of the nature and amount of informa- 
tion that need to be incorporated into the answers. The “trick” is learning to 
identify which part of the “given” corresponds to B and which parts correspond 
to the A;’s. The following hints may help. 


1. As you read the question, pay particular attention to the last one or two 
sentences. Is the problem asking for an unconditional probability (in which 
case Theorem 2.4.1 applies) or a conditional probability (in which case 
Theorem 2.4.2 applies)? 

2. If the question is asking for an unconditional probability, let B denote the 
event whose probability you are trying to find; if the question is asking for 
a conditional probability, let B denote the event that has already happened. 

3. Once event B has been identified, reread the beginning of the question and 
assign the A;’s. 


Example A biased coin, twice as likely to come up heads as tails, is tossed once. If it shows 
2.4.13 heads, a chip is drawn from urn I, which contains three white chips and four red 
chips; if it shows tails, a chip is drawn from urn IJ, which contains six white chips and 
three red chips. Given that a white chip was drawn, what is the probability that the 

coin came up tails (see Figure 2.4.10)? 


ads Pails 


ae 
3W 6W 
4R 3R 


Urn I Urn IT 
White 
is drawn 


Figure 2.4.10 


Since P (heads) = 2P (tails), it must be true that P (heads) = 5 and P (tails) = i. 
Define the events 


B: white chip is drawn 
A,: coin came up heads (i.e., chip came from urn I) 
A2: coin came up tails (i.e., chip came from urn II) 


Our objective is to find P(A2|B). From Figure 2.4.10, 


3 


P(B\A\)= > P(BIA2)= 


WwlrNm nv] 
Wl Ola 


P(Ai)= P(A2)= 
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Example 


2.4.14 


Example 
2.4.15 


SO 


P(B|A2) P(A2) 

P(B|A\)P(A,) + P(BIA2) P(A2) 

= (6/9)(1/3) 

~ (3/7)(2/3) + (6/9)(1/3) 

_ 7 

~ 16 = 


P(A2|B) = 


During a power blackout, one hundred persons are arrested on suspicion of looting. 
Each is given a polygraph test. From past experience it is known that the poly- 
graph is 90% reliable when administered to a guilty suspect and 98% reliable when 
given to someone who is innocent. Suppose that of the one hundred persons taken 
into custody, only twelve were actually involved in any wrongdoing. What is the 
probability that a given suspect is innocent given that the polygraph says he is guilty? 

Let B be the event “Polygraph says suspect is guilty,’ and let A; and Az be 
the events “Suspect is guilty” and “Suspect is not guilty,” respectively. To say that 
the polygraph is “90% reliable when administered to a guilty suspect” means that 
P(B|A,) = 0.90. Similarly, the 98% reliability for innocent suspects implies that 
P(B°|A2) = 0.98, or, equivalently, P(B|A2) = 0.02. 

We also know that P(A;)= aT and P(A2)= x. Substituting into Theorem 2.4.2, 
then, shows that the probability a suspect is innocent given that the polygraph says 
he is guilty is 0.14: 


P(BI\A2) P(A2) 
P(B\A,)P(A1) + P(B\A2) P(A2) 
(0.02)(88/100) 
= (0.90) (12/100) + (0.02)(88/100) 
=0.14 ] 


P(A2|B)= 


As medical technology advances and adults become more health conscious, the 
demand for diagnostic screening tests inevitably increases. Looking for problems, 
though, when no symptoms are present can have undesirable consequences that 
may outweigh the intended benefits. 

Suppose, for example, a woman has a medical procedure performed to see 
whether she has a certain type of cancer. Let B denote the event that the test says she 
has cancer, and let A; denote the event that she actually does (and Az, the event that 
she does not). Furthermore, suppose the prevalence of the disease and the precision 
of the diagnostic test are such that 


P(A;)=0.0001 [and P(A>) =0.9999] 
P(B|A;) =0.90= P(Test says woman has cancer when, in fact, she does) 
P(B|A2) = P(B|A{) =0.001 = P(false positive) = P(Test says woman has cancer 


when, in fact, she does not) 


What is the probability that she does have cancer, given that the diagnostic 
procedure says she does? That is, calculate P(A,|B). 


Example 
2.4.16 
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Although the method of solution here is straightforward, the actual numerical 
answer is not what we would expect. From Theorem 2.4.2, 


P(B\A,) P(A1) 
P(B\A\) P(A) + P(BIAS) P(AS) 
(0.9) (0.0001) 
~ (0.9)(0.0001) + (0.001) (0.9999) 
= 0.08 


P(A, |B)= 


So, only 8% of those women identified as having cancer actually do! Table 2.4.2 
shows the strong dependence of P(A;|B) on P(A;) and P(B\AC). 


Table 2.4.2 


P(A,) P(BIAT) P(Ai|B) 


0.0001 0.001 0.08 
0.0001 0.47 
0.001 0.001 0.47 
0.0001 0.90 
0.01 0.001 0.90 
0.0001 0.99 


In light of these probabilities, the practicality of screening programs directed at 
diseases having a low prevalence is open to question, especially when the diagnostic 
procedure, itself, poses a nontrivial health risk. (For precisely those two reasons, the 
use of chest X-rays to screen for tuberculosis is no longer advocated by the medical 
community.) = 


According to the manufacturer’s specifications, your home burglar alarm has a 95% 
chance of going off if someone breaks into your house. During the two years you 
have lived there, the alarm has gone off on five different nights, each time for no 
apparent reason. Suppose the alarm goes off tomorrow night. What is the proba- 
bility that someone is trying to break into your house? (Note: Police statistics show 
that the chances of any particular house in your neighborhood being burglarized on 
any given night are two in ten thousand.) 

Let B be the event “Alarm goes off tomorrow night,” and let A, and A, be 
the events “House is being burglarized” and “House is not being burglarized,” 
respectively. Then 


P(B\A;) =0.95 
P(B|A2) =5/730_ (ie., five nights in two years) 
P(A) =2/10, 000 
P(A2) = 1— P(A) = 9998/10, 000 
The probability in question is P(A,|B). 


Intuitively, it might seem that P(A;|B) should be close to 1 because the alarm’s 
“performance” probabilities look good—P(B|A) is close to 1 (as it should be) 
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and P(B|A2) is close to 0 (as it should be). Nevertheless, P(A;|B) turns out to 


be surprisingly small: 


P(A,|B)= 


P(BIA\) P(Aj) 


P(B|A\)P(Aj) + P(BIA2)P(A2) 


(0.95)(2/10,000) 


~ (0.95) (2/10,000) + (5/730) (9998 /10,000) 
= 0.027 


That is, if you hear the alarm going off, the probability is only 0.027 that your house 


is being burglarized. 


Computationally, the reason P(A,;|B) is so small is that P(A2) is so large. The 
latter makes the denominator of P(A;|B) large and, in effect, “washes out” the 
numerator. Even if P(B|A,) were substantially increased (by installing a more 
expensive alarm), P(A;|B) would remain largely unchanged (see Table 2.4.3). 


Table 2.4.3 


P(B\A;) 


0.95 0.97 


0.99 


0.999 


P(A\|B) 


0.027 0.028 0.028 0.028 


Questions 


2.4.40. Urn I contains two white chips and one red chip; 
urn II has one white chip and two red chips. One chip is 
drawn at random from urn I and transferred to urn II. 
Then one chip is drawn from urn II. Suppose that a red 
chip is selected from urn II. What is the probability that 
the chip transferred was white? 


2.4.41. Urn I contains three red chips and five white chips; 
urn II contains four reds and four whites; urn III contains 
five reds and three whites. One urn is chosen at random 
and one chip is drawn from that urn. Given that the chip 
drawn was red, what is the probability that III was the urn 
sampled? 


2.4.42. A dashboard warning light is supposed to flash red 
if a car’s oil pressure is too low. On a certain model, the 
probability of the light flashing when it should is 0.99; 2% 
of the time, though, it flashes for no apparent reason. If 
there is a 10% chance that the oil pressure really is low, 
what is the probability that a driver needs to be concerned 
if the warning light goes on? 


2.4.43. Building permits were issued last year to three 
contractors starting up a new subdivision: Tara Construc- 
tion built two houses; Westview, three houses; and Hearth- 
stone, six houses. Tara’s houses have a 60% probability of 
developing leaky basements; homes built by Westview and 


Hearthstone have that same problem 50% of the time and 
40% of the time, respectively. Yesterday, the Better Busi- 
ness Bureau received a complaint from one of the new 
homeowners that his basement is leaking. Who is most 
likely to have been the contractor? 


2.4.44. Two sections of a senior probability course are 
being taught. From what she has heard about the two 
instructors listed, Francesca estimates that her chances of 
passing the course are 0.85 if she gets Professor X and 0.60 
if she gets Professor Y. The section into which she is put 
is determined by the registrar. Suppose that her chances 
of being assigned to Professor X are four out of ten. Fif- 
teen weeks later we learn that Francesca did, indeed, pass 
the course. What is the probability she was enrolled in 
Professor X’s section? 


2.4.45. A liquor store owner is willing to cash per- 
sonal checks for amounts up to $50, but she has become 
wary of customers who wear sunglasses. Fifty percent 
of checks written by persons wearing sunglasses bounce. 
In contrast, 98% of the checks written by persons not 
wearing sunglasses clear the bank. She estimates that 
10% of her customers wear sunglasses. If the bank 
returns a check and marks it “insufficient funds,” what 
is the probability it was written by someone wearing 
sunglasses? 


2.4.46. Brett and Margo have each thought about mur- 
dering their rich Uncle Basil in hopes of claiming their 
inheritance a bit early. Hoping to take advantage of Basil’s 
predilection for immoderate desserts, Brett has put rat 
poison into the cherries flambé; Margo, unaware of Brett’s 
activities, has laced the chocolate mousse with cyanide. 
Given the amounts likely to be eaten, the probability 
of the rat poison being fatal is 0.60; the cyanide, 0.90. 
Based on other dinners where Basil was presented with 
the same dessert options, we can assume that he has a 50% 
chance of asking for the cherries flambé, a 40% chance of 
ordering the chocolate mousse, and a 10% chance of skip- 
ping dessert altogether. No sooner are the dishes cleared 
away than Basil drops dead. In the absence of any other 
evidence, who should be considered the prime suspect? 


2.4.47. Josh takes a twenty-question multiple-choice 
exam where each question has five possible answers. Some 
of the answers he knows, while others he gets right just by 
making lucky guesses. Suppose that the conditional prob- 
ability of his knowing the answer to a randomly selected 
question given that he got it right is 0.92. How many of the 
twenty questions was he prepared for? 


2.4.48. Recently the U.S. Senate Committee on Labor 
and Public Welfare investigated the feasibility of setting 
up a national screening program to detect child abuse. 
A team of consultants estimated the following probabil- 
ities: (1) one child in ninety is abused, (2) a screening 
program can detect an abused child 90% of the time, and 
(3) a screening program would incorrectly label 3% of all 
nonabused children as abused. What is the probability that 
a child is actually abused given that the screening program 
makes that diagnosis? How does the probability change if 
the incidence of abuse is one in one thousand? Or one in 
fifty? 


2.4.49. At State University, 30% of the students are 
majoring in humanities, 50% in history and culture, 
and 20% in science. Moreover, according to figures 
released by the registrar, the percentages of women 
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majoring in humanities, history and culture, and science 
are 75%, 45%, and 30%, respectively. Suppose Justin 
meets Anna at a fraternity party. What is the probability 
that Anna is a history and culture major? 


2.4.50. An “eyes-only” diplomatic message is to be trans- 
mitted as a binary code of 0’s and 1’s. Past experience with 
the equipment being used suggests that if a 0 is sent, it will 
be (correctly) received as a 0 90% of the time (and mis- 
takenly decoded as a 1 10% of the time). If a 1 is sent, it 
will be received as a 1 95% of the time (and as a0 5% of 
the time). The text being sent is thought to be 70% 1’s and 
30% 0’s. Suppose the next signal sent is received as a 1. 
What is the probability that it was sent as a 0? 


2.4.51. When Zach wants to contact his girlfriend and he 
knows she is not at home, he is twice as likely to send her 
an e-mail as he is to leave a message on her answering 
machine. The probability that she responds to his e-mail 
within three hours is 80%; her chances of being similarly 
prompt in answering a phone message increase to 90%. 
Suppose she responded within two hours to the message 
he left this morning. What is the probability that Zach was 
communicating with her via e-mail? 


2.4.52. A dot-com company ships products from three 
different warehouses (A, B, and C). Based on customer 
complaints, it appears that 3% of the shipments coming 
from A are somehow faulty, as are 5% of the shipments 
coming from B, and 2% coming from C. Suppose a cus- 
tomer is mailed an order and calls in a complaint the next 
day. What is the probability the item came from Ware- 
house C? Assume that Warehouses A, B, and C ship 30%, 
20%, and 50% of the dot-com’s sales, respectively. 


2.4.53. A desk has three drawers. The first contains two 
gold coins, the second has two silver coins, and the third 
has one gold coin and one silver coin. A coin is drawn from 
a drawer selected at random. Suppose the coin selected 
was silver. What is the probability that the other coin in 
that drawer is gold? 


Section 2.4 dealt with the problem of reevaluating the probability of a given event 
in light of the additional information that some other event has already occurred. It 
often is the case, though, that the probability of the given event remains unchanged, 
regardless of the outcome of the second event—that is, P(A|B) = P(A) = P(A|B°). 
Events sharing this property are said to be independent. Definition 2.5.1 gives a 
necessary and sufficient condition for two events to be independent. 


P(A): P(B). 


Definition 2.5.1. Two events A and B are said to be independent if P(AN B) = 
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Example 
2.5.1 


Example 
2.5.2 


Example 


2.5.3 


Comment The fact that the probability of the intersection of two independent 
events is equal to the product of their individual probabilities follows immediately 
from our first definition of independence, that P(A|B) = P(A). Recall that the def- 
inition of conditional probability holds true for any two events A and B [provided 
that P(B > 0)]: 
P(ANB) 

P(B) 
But P(A|B) can equal P(A) only if P(AN B) factors into P(A) times P(B). 


P(A|B) = 


Let A be the event of drawing a king from a standard poker deck and B, the 
event of drawing a diamond. Then, by Definition 2.5.1, A and B are independent 
because the probability of their intersection— drawing a king of diamonds —is equal 
to P(A)- P(B): 


1 
7g = PUA) PCB) o 


Ale 


1 
P(ANB)= = 


Suppose that A and B are independent events. Does it follow that AC and B® are 
also independent? That is, does P(A MB) = P(A)- P(B) guarantee that P(AT 
B°) = P(A°)- P(B°)? 

Yes. The proof is accomplished by equating two different expressions for 
P(A© UB‘). First, by Theorem 2.3.6, 


P(AS U B®) = P(AS) + P(BS) — P(A NB) (2.5.1) 


But the union of two complements is the complement of their intersection (recall 
Question 2.2.32). Therefore, 


P(A U B®) =1— P(ANB) (2.5.2) 
Combining Equations 2.5.1 and 2.5.2, we get 
1— P(AN B)=1-— P(A) +1— P(B)— P(AC NB) 
Since A and B are independent, P(AN B) = P(A)- P(B), so 
P(A© 0 B®) =1— P(A) +1— P(B) —[1— P(A)- P(B)] 
=[1— P(A)][1 — P(B)] 
= P(A°)- P(B°) 


the latter factorization implying that A© and B© are, themselves, independent. (If A 
and B are independent, are A and B© independent?) o 


Electronics Warehouse is responding to affirmative-action litigation by establishing 
hiring goals by race and sex for its office staff. So far they have agreed to employ 
the 120 people characterized in Table 2.5.1. How many black women do they need 
in order for the events A: Employee is female and B: Employee is black to be 
independent? 

Let x denote the number of black women necessary for A and B to be 
independent. Then 


P(AN B)= P(black female) = x/(120+ x) 


Example 
2.5.4 
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must equal 
P(A) P(B) = P (female) P (black) = [(40 + x)/(120 + x)]- [30+ x)/(120+4 x)] 


Setting x/(120 + x) =[(40 + x)/(120 + x)] -[(30 + x)/(120 + x)] implies that x = 24 
black women need to be on the staff in order for A and B to be independent. 


Table 2.5.1 
White Black 
Male 50 30 
Female 40 = 


Comment Having shown that “Employee is female” and “Employee is black” are 
independent, does it follow that, say, “Employee is male” and “Employee is white” 
are independent? Yes. By virtue of the derivation in Example 2.5.2, the indepen- 
dence of events A and B implies the independence of events A© and B® (as well as 
A and B© and A© and B). It follows, then, that the x = 24 black women not only 
makes A and B independent, it also implies, more generally, that “race” and “sex” 
are independent. 


Suppose that two events, A and B, each having nonzero probability, are mutually 
exclusive. Are they also independent? 

No. If A and B are mutually exclusive, then P(AN B) =0. But P(A)- P(B) >0 
(by assumption), so the equality spelled out in Definition 2.5.1 that characterizes 
independence is not met. o 


Deducing Independence 


Sometimes the physical circumstances surrounding two events make it obvious that 
the occurrence (or nonoccurrence) of one has absolutely no influence or effect on 
the occurrence (or nonoccurrence) of the other. If that should be the case, then the 
two events will necessarily be independent in the sense of Definition 2.5.1. 

Suppose a coin is tossed twice. Clearly, whatever happens on the first toss has 
no physical connection or influence on the outcome of the second. If A and B, then, 
are events defined on the second and first tosses, respectively, it would have to be 
the case that P(A|B) = P(A|B°) = P(A). For example, let A be the event that the 
second toss of a fair coin is a head, and let B be the event that the first toss of that 
coin is a tail. Then 


P(A|B) =P (head on second toss | tail on first toss) 


1 
= P(head on second toss) = 5 


Being able to infer that certain events are independent proves to be of enor- 
mous help in solving certain problems. The reason is that many events of interest 
are, in fact, intersections. If those events are independent, then the probability of 
that intersection reduces to a simple product (because of Definition 2.5.1) —that is, 
P(AN B) = P(A)- P(B). For the coin tosses just described, 
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Example 


2.5.5 


Example 
2.5.6 


P(AN B)= P(head on second toss / tail on first toss) 
= P(A): P(B) 


= P(head on second toss) - P (tail on first toss) 


“2 


Myra and Carlos are summer interns working as proofreaders for a local newspaper. 
Based on aptitude tests, Myra has a 50% chance of spotting a hyphenation error, 
while Carlos picks up on that same kind of mistake 80% of the time. Suppose the 
copy they are proofing contains a hyphenation error. What is the probability it goes 
undetected? 

Let A and B be the events that Myra and Carlos, respectively, catch the mis- 
take. By assumption, P(A) =0.50 and P(B) = 0.80. What we are looking for is the 
probability of the complement of a union. That is, 


P (Error goes undetected) = 1 — P(Error is detected) 
= | — P(Myra or Carlos or both see the mistake) 
=1—P(AUB) 
=1—{P(A)+ P(B)— P(ANB)} (from Theorem 2.3.6) 


Since proofreaders invariably work by themselves, events A and B are necessarily 
independent, so P(A NM B) would reduce to the product P(A) - P(B). It follows that 
such an error would go unnoticed 10% of the time: 


P(Error goes undetected) = 1 — {0.50 + 0.80 — (0.50) (0.80)} = 1 — 0.90 
=0.10 a 


Suppose that one of the genes associated with the control of carbohydrate 
metabolism exhibits two alleles—a dominant W and a recessive w. If the proba- 
bilities of the WW, Ww, and ww genotypes in the present generation are p,q, andr, 
respectively, for both males and females, what are the chances that an individual in 
the next generation will be a ww? 

Let A denote the event that an offspring receives a w allele from her father; let 
B denote the event that she receives the recessive allele from her mother. What we 
are looking for is P(ANM B). 

According to the information given, 


p = P (Parent has genotype WW) = P(WW) 
q = P (Parent has genotype Ww) = P(Ww) 


r = P(Parent has genotype ww) = P(ww) 


If an offspring is equally likely to receive either of her parent’s alleles, the 
probabilities of A and B can be computed using Theorem 2.4.1: 
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P(A) = P(A| WW) P(WW) + P(A| Ww) P(Ww) + P(A | ww) P(ww) 


1 
=Uep oa er 


q 
=r+==P(B 
ee) 


Lacking any evidence to the contrary, there is every reason here to assume that A 
and B are independent events, in which case 


P(AN B)= P (Offspring has genotype ww) 
= P(A)- P(B) 


=(r+ a 


(This particular model for allele segregation, together with the independence 
assumption, is called random Mendelian mating.) = 


Emma and Josh have just gotten engaged. What is the probability that they have 
different blood types? Assume that blood types for both men and women are 
distributed in the general population according to the following proportions: 


Blood Type Proportion 


A 40% 
B 10% 
AB 5% 
O 45% 


First, note that the event “Emma and Josh have different blood types” includes 
more possibilities than does the event “Emma and Josh have the same blood type.” 
That being the case, the complement will be easier to work with than the question 
originally posed. We can start, then, by writing 


P(Emma and Josh have different blood types) 
= | — P(Emma and Josh have the same blood type) 

Now, if we let Ey and Jy represent the events that Emma and Josh, respectively, 
have blood type X, then the event “Emma and Josh have the same blood type” is a 
union of intersections, and we can write 

P(Emma and Josh have the same blood type) = P{(E4M J4) U (EB N Jz) 

U (Bap fl Jaz) U (Eon Jo)} 
Since the four intersections here are mutually exclusive, the probability of their 
union becomes the sum of their probabilities. Moreover, “blood type” is not a 
factor in the selection of a spouse, so Ey and Jy are independent events and 


P(Ex 0 Jy) = P(Ex)P(Jx). It follows, then, that Emma and Josh have a 62.5% 
chance of having different blood types: 
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P(Emma and Josh have different blood types) = 1 — {P(E,4)P(J4) + P(Eg)P (Jp) 


Questions 


2.5.1. Suppose that P(A NM B) = 0.2, P(A) = 0.6, and 
P(B)=0.5. 


(a) Are A and B mutually exclusive? 
(b) Are A and B independent? 
(c) Find P(AC UBS). 


2.5.2. Spike is not a terribly bright student. His chances 
of passing chemistry are 0.35; mathematics, 0.40; and both, 
0.12. Are the events “Spike passes chemistry” and “Spike 
passes mathematics” independent? What is the probabil- 
ity that he fails both subjects? 


2.5.3. Two fair dice are rolled. What is the probability 
that the number showing on one will be twice the number 
appearing on the other? 


2.5.4. Urn I has three red chips, two black chips, and five 
white chips; urn IJ has two red, four black, and three white. 
One chip is drawn at random from each urn. What is the 
probability that both chips are the same color? 


2.5.5. Dana and Cathy are playing tennis. The probability 
that Dana wins at least one out of two games is 0.3. What 
is the probability that Dana wins at least one out of four? 


2.5.6. Three points, X,, X., and X3, are chosen at random 
in the interval (0, a). A second set of three points, Y;, Y2, 
and Y;3, are chosen at random in the interval (0, b). Let A 
be the event that X, is between X, and X;. Let B be the 
event that Y, < Y, < Y3. Find P(AN B). 


+ P(Eas)P(Jaz) + P(Eo)P(Jo)} 
= 1 — {(0.40) (0.40) + (0.10) (0.10) 

+ (0.05) (0.05) + (0.45) (0.45)} 
= 0.625 = 


2.5.7. Suppose that P(A) = + and P(B) =} 


a" 


(a) What does P(A U B) equal if 
1. A and B are mutually exclusive? 
2. A and B are independent? 

(b) What does P(A | B) equal if 
1. A and B are mutually exclusive? 
2. A and B are independent? 


2.5.8. Suppose that events A, B, and C are independent. 


(a) Use a Venn diagram to find an expression for P(A U 
BUC) that does not make use of a complement. 

(b) Find an expression for P(A U BUC) that does make 
use of a complement. 


2.5.9. A fair coin is tossed four times. What is the proba- 
bility that the number of heads appearing on the first two 
tosses is equal to the number of heads appearing on the 
second two tosses? 


2.5.10. Suppose that two cards are drawn simultaneously 
from a standard 52-card poker deck. Let A be the event 
that both are either a jack, queen, king, or ace of hearts, 
and let B be the event that both are aces. Are A and B 
independent? (Note: There are 1326 equally likely ways to 
draw two cards from a poker deck.) 


Defining the Independence of More Than Two Events 


It is not immediately obvious how to extend Definition 2.5.1 to, say, three events. To 
call A, B, and C independent, should we require that the probability of the three-way 
intersection factors into the product of the three original probabilities, 


P(ANBNC)= P(A): P(B)- P(C) 


(2.5.3) 
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or should we impose the definition we already have on the three pairs of events: 


P(AN B)= P(A)- P(B) 
P(BNC)=P(B)- P(C) (2.5.4) 
P(ANC)= P(A): P(C) 


Actually, neither condition by itself is sufficient. If three events satisfy Equa- 
tions 2.5.3 and 2.5.4, we will call them independent (or mutually independent), 
but Equation 2.5.3 does not imply Equation 2.5.4, nor does Equation 2.5.4 imply 
Equation 2.5.3 (see Questions 2.5.11 and 2.5.12). 

More generally, the independence of n events requires that the probabilities 
of all possible intersections equal the products of all the corresponding individual 
probabilities. Definition 2.5.2 states the result formally. Analogous to what was true 
in the case of two events, the practical applications of Definition 2.5.2 arise when 
n events are mutually independent, and we can calculate P(A; 1 A2M---M An) by 
computing the product P(A;)- P(A2)--- P(A;). 


Definition 2.5.2. Events Aj, Az, ..., A, are said to be independent if for every 
set of indices i, i2, ..., i, between 1 and n, inclusive, 


P(A;, N Aj, O-+> 0 Aj,) = P(Ai,) > P(A) +++ P(Aj,) 


An insurance company plans to assess its future liabilities by sampling the records 
of its current policyholders. A pilot study has turned up three clients—one living in 
Alaska, one in Missouri, and one in Vermont— whose estimated chances of surviving 
to the year 2015 are 0.7, 0.9, and 0.3, respectively. What is the probability that by the 
end of 2014 the company will have had to pay death benefits to exactly one of the 
three? 

Let A; be the event “Alaska client survives through 2014.” Define A» and A3 
analogously for the Missouri client and Vermont client, respectively. Then the event 
E: “Exactly one dies” can be written as the union of three intersections: 


E=(A;N AN AS)U (ALN AS NA3) U (AP 9 AN A3) 
Since each of the intersections is mutually exclusive of the other two, 
P(E) = P(A1N A2N AS) + P(A1N AS 1. A3) + P(AT 9 A2M As) 


Furthermore, there is no reason to believe that for all practical purposes the fates 
of the three are not independent. That being the case, each of the intersection 
probabilities reduces to a product, and we can write 


P(E) = P(A;)- P(Az)- P(AS) + P(A1)-P(AS)-P(A3) + P(AS)- P(A2)- P(A3) 
= (0.7)(0.9)(0.7) + (0.7)(0.1)(0.3) + (0.3) (0.9) (0.3) 
= 0.543 | 


Comment “Declaring” events independent for reasons other than those prescribed 
in Definition 2.5.2 is a necessarily subjective endeavor. Here we might feel fairly 
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certain that a “random” person dying in Alaska will not affect the survival chances 
of a “random” person residing in Missouri (or Vermont). But there may be special 
circumstances that invalidate that sort of argument. For example, what if the three 
individuals in question were mercenaries fighting in an African border war and were 
all crew members assigned to the same helicopter? In practice, all we can do is look 
at each situation on an individual basis and try to make a reasonable judgment as to 
whether the occurrence of one event is likely to influence the outcome of another 
event. 


Protocol for making financial decisions in a certain corporation follows the “circuit” 
pictured in Figure 2.5.1. Any budget is first screened by 1. If he approves it, the plan 
is forwarded to 2, 3, and 5. If either 2 or 3 concurs, it goes to 4. If either 4 or 5 
says “yes,” it moves on to 6 for a final reading. Only if 6 is also in agreement does 
the proposal pass. Suppose that 1, 5, and 6 each has a 50% chance of saying “yes,” 
whereas 2, 3, and 4 will each concur with a probability of 0.70. If everyone comes to 
a decision independently, what is the probability that a budget will pass? 


(2) 
Ce, 
a= =o 


@) 


>) t—(6 }—> 


(s) 
er 


Figure 2.5.1 


Probabilities of this sort are calculated by reducing the circuit to its component 
unions and intersections. Moreover, if all decisions are made independently, which 
is the case here, then every intersection becomes a product. 

Let A; be the event that person i approves the budget, i = 1,2,...,6. Looking 
at Figure 2.5.1, we see that 


P (Budget passes) = P(A; 1M {[(A2 U A3) N Aa] U As} Ao) 
= P(A)) P{[(A2 U A3) N Ag] U As} P(Ao) 


By assumption, P(A,)=0.5, P(A2) =0.7, P(A3) =0.7, P(A4) =0.7, P(As) =0.5, and 
P(Ao) =0.5, so 


P{[(A2 U A3) M Ag} =[P(Az2) + P(A3) — P(A2) P(A3)] P (Aa) 
= [0.7 + 0.7 — (0.7) (0.7) ](0.7) 


= 0.637 
Therefore, 


P (Budget passes) = (0.5){0.637 + 0.5 — (0.637) (0.5)}(0.5) 
= 0.205 | 
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Repeated Independent Events 


We have already seen several examples where the event of interest was actually 
an intersection of independent simpler events (in which case the probability of the 
intersection reduced to a product). There is a special case of that basic scenario that 
deserves special mention because it applies to numerous real-world situations. If the 
events making up the intersection all arise from the same physical circumstances 
and assumptions (i.e., they represent repetitions of the same experiment), they are 
referred to as repeated independent trials. The number of such trials may be finite or 
infinite. 


Suppose the string of Christmas tree lights you just bought has twenty-four bulbs 
wired in series. If each bulb has a 99.9% chance of “working” the first time current 
is applied, what is the probability that the string itself will not work? 

Let A; be the event that the ith bulb fails, i =1,2,...,24. Then 


P (String fails) = P(At least one bulb fails) 
= P(A; UA U---U Aga) 
= | — P(String works) 
= 1 — P(All twenty-four bulbs work) 
=1—P(ASNASN---N AS) 
If we assume that bulb failures are independent events, 
P (String fails) = 1— P(A) P(AS)--- P(AS,) 


Moreover, since all the bulbs are presumably manufactured the same way, P(A‘) is 
the same for all i, so 


P (String fails) = 1— { P(AC)}™ 


= 1—(0.999)*4 
=1-0.98 
=0.02 


The chances are one in fifty, in other words, that the string will not work the first 
time current is applied. a 


During the 1978 baseball season, Pete Rose of the Cincinnati Reds set a National 
League record by hitting safely in forty-four consecutive games. Assume that Rose 
was a .300 hitter and that he came to bat four times each game. If each at-bat 
is assumed to be an independent event, what probability might reasonably be 
associated with a hitting streak of that length? 

For this problem we need to invoke the repeated independent trials model 
twice—once for the four at-bats making up a game and a second time for the forty- 
four games making up the streak. Let A; denote the event “Rose hit safely in ith 
game,” i=1,2,...,44. Then 


P(Rose hit safely in forty-four consecutive games) = P(A; M A2N---M Aga) 
= P(A1)- P(A2)----- P (Aaa) 
(2.5.5) 
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Since all the P(A;)’s are equal, we can further simplify Equation 2.5.5 by writing 
P(Rose hit safely in forty-four consecutive games) = [P(A,)]* 
To calculate P(A,) we should focus on the complement of A,. Specifically, 
P(A,) =1— P(Af) 
= 1— P(Rose did not hit safely in Game 1) 
= 1 — P(Rose made four outs) 
=1-—(0.700)* (Why?) 
= 0.76 


Therefore, the probability of a .300 hitter putting together a forty-four-game streak 
(during a given set of forty-four games) is 0.0000057: 


P(Rose hit safely in forty-four consecutive games) = (0.76)4 


= 0.0000057 ] 


Comment The analysis described here has the basic “structure” of a repeated inde- 
pendent trials problem, but the assumptions that the latter makes are not entirely 
satisfied by the data. Each at-bat, for example, is not really a repetition of the same 
experiment, nor is P(A;) the same for alli. Rose would obviously have had different 
probabilities of getting a hit against different pitchers. Moreover, although “four” 
was probably the typical number of official at-bats that he had during a game, there 
would certainly have been many instances where he had either fewer or more. Mod- 
est deviations from game to game, though, would not have had a major effect on the 
probability associated with Rose’s forty-four-game streak. 


In the game of craps, one of the ways a player can win is by rolling (with two dice) 
one of the sums 4, 5, 6, 8, 9, or 10, and then rolling that sum again before rolling a 
sum of 7. For example, the sequence of sums 6, 5, 8, 8, 6 would result in the player 
winning on his fifth roll. In gambling parlance, “6” is the player’s “point,” and he 
“made his point.” On the other hand, the sequence of sums 8, 4, 10, 7 would result 
in the player losing on his fourth roll: his point was an 8, but he rolled a sum of 7 
before he rolled a second 8. What is the probability that a player wins with a point 
of 10? 


Table 2.5.2 


Sequence of Rolls Probability 
(10, 10) (3/36)(3/36) 
(10, no 10 or 7, 10) (3/36)(27/36)(3/36) 


(10, no 10 or 7, no 10 or 7,10) (3/36)(27/36)(27/36)(3/36) 


Table 2.5.2 shows some of the ways a player can make a point of 10. Each 
sequence, of course, is an intersection of independent events, so its probability 
becomes a product. The event “Player wins with a point of 10” is then the union 
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of all the sequences that could have been listed in the first column. Since all those 
sequences are mutually exclusive, the probability of winning with a point of 10 
reduces to the sum of an infinite number of products: 


3. 3 3 27 3 


P(Pl i ith int of 10) = : . : 
(Player wins with a point of 10) 36 36 36 36 36 


a. OF 27° 3 
36 36 36 36 
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Recall from algebra that if0<r <1, 


rk=1/(1-r) 


Me 


k 


ll 
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Applying the formula for the sum of a geometric series to Equation 2.5.6 shows that 
the probability of winning at craps with a point of 10 is q: 


3. 3 1 
P (Player wins with a point of 10) = — - — . ——.— 
36 36 (1-2) 
a. 
36 


Table 2.5.3 


Point P (makes point) 


4 1/36 
5 16/360 
6 25/396 
8 25/396 
9 16/360 
0 1/36 


Table 2.5.3 shows the probabilities of a person “making” each of the possible six 
points—4, 5, 6, 8, 9, and 10. According to the rules of craps, a player wins by either 
(1) getting a sum of 7 or 11 on the first roll or (2) getting a 4,5, 6,8, 9, or 10 on the 
first roll and making the point. But P(sum = 7) = 6/36 and P(sum = 11) = 2/36, so 


pines Oe a ele 
; ~ 36 36 36 360 396 396 | 360. 36 


= 0.493 
As even-money games go, craps is relatively fair—the probability of the shooter 
winning is not much less than 0.500. = 
Example A transmitter is sending a binary code (+ and — signals) that must pass through 
2.5.13 three relay signals before being sent on to the receiver (see Figure 2.5.2). At each 


relay station, there is a 25% chance that the signal will be reversed —that is 
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P(+ is sent by relay i|— is received by relay i) 
= P(— is sent by relay i|+ is received by relay i) 
=1/4,i=1,2,3 


Suppose + symbols make up 60% of the message being sent. If the signal + is 
received, what is the probability a + was sent? 


(+ ?) (+) 
> 1 2 3 > Receiver 


Figure 2.5.2 


N 
N 


This is basically a Bayes’ Theorem (Theorem 2.4.2) problem, but the three relay 
stations introduce a more complex mechanism for transmission error. Let A be the 
event “+ is transmitted from tower” and B be the event “+ is received from relay 3.” 
Then 


P(B\A) P(A) 


P(AIB) = 5 (BIA)P(A) + P(BIA) P(A) 


Notice that a + can be received from relay 3 given that a + was initially sent from 
the tower if either (1) all relay stations function properly or (2) any two of the sta- 
tions make transmission errors. Table 2.5.4 shows the four mutually exclusive ways 
(1) and (2) can happen. The probabilities associated with the message transmissions 
at each relay station are shown in parentheses. Assuming the relay station outputs 
are independent events, the probability of an entire transmission sequence is sim- 
ply the product of the probabilities in parentheses in any given row. These overall 
probabilities are listed in the last column; their sum, 36/64, is P(B|A). By a similar 
analysis, we can show that 


P(B|A°) = P(+ is received from relay 3|— is transmitted from tower) = 28/64 


Finally, since P(A) = 0.6 and then P(A©) = 0.4, the conditional probability we 
are looking for is 0.66: 


36 
(st) (0.6) 0.66 


ey= (28) (0.6) + (2) 0.4) - 


Table 2.5.4 


Signal transmitted by 
Tower Relay1 Relay2 Relay3 Probability 


+(3/4) (1/4) +(1/4) 3/64 
(1/4) -(@G/4) +(1/4) 3/64 
—(1/4) +€/4) +G/4) 3/64 
+(/4) +(3/4) 43/4) 27/64 

36/64 


+++4+ 
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Andy, Bob, and Charley have gotten into a disagreement over a female acquain- 
tance, Donna, and decide to settle their dispute with a three-cornered pistol duel. 
Of the three, Andy is the worst shot, hitting his target only 30% of the time. Charley, 
a little better, is on-target 50% of the time, while Bob never misses (see Figure 2.5.3). 
The rules they agree to are simple: They are to fire at the targets of their choice in 
succession, and cyclically, in the order Andy, Bob, Charley, and so on, until only one 
of them is left standing. On each “turn,” they get only one shot. If a combatant is 
hit, he no longer participates, either as a target or as a shooter. 


Andy 
P (hits target) 0.3 
Bob Charley 
P (hits target) 1.0 P (hits target) 0.5 
Figure 2.5.3 


As Andy loads his revolver, he mulls over his options (his objective is clear—to max- 
imize his probability of survival). According to the rule he can shoot either Bob or 
Charley, but he quickly rules out shooting at the latter because it would be counter- 
productive to his future well-being: If he shot at Charley and had the misfortune of 
hitting him, it would be Bob’s turn, and Bob would have no recourse but to shoot 
at Andy. From Andy’s point of view, this would be a decidedly grim turn of events, 
since Bob never misses. Clearly, Andy’s only option is to shoot at Bob. This leaves 
two scenarios: (1) He shoots at Bob and hits him, or (2) he shoots at Bob and misses. 

Consider the first possibility. If Andy hits Bob, Charley will proceed to shoot 
at Andy, Andy will shoot back at Charley, and so on, until one of them hits the 
other. Let CH; and CM; denote the events “Charley hits Andy with the ith shot” 
and “Charley misses Andy with the ith shot,” respectively. Define AH; and AM; 
analogously. Then Andy’s chances of survival (given that he has killed Bob) reduce 
to a countably infinite union of intersections: 


P(Andy survives) =P[(CM, 1 AM) U(CM;N AM, NCM2 0 Ad) 
U(CM,N AM; NCM.N AM.0NCM30 AA) U---] 


Note that each intersection is mutually exclusive of all of the others and that its 
component events are independent. Therefore, 


P(Andy survives) = P(CM,)P(AH,) + P(CM,) P(AM) P(C M2) P(AAd) 
+ P(CM,)P(AM,)P(CM2)P(AM>) P(C M3) P(AH3)+--- 
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= (0.5) (0.3) + (0.5) (0.7) (0.5) (0.3) + (0.5) (0.7) (0.5) (0.7) (0.5) (0.3) +--- 


= (0.5)(0.3) ) > (0.35)! 


k=0 


1 
= (a3) “3B 


3 


Now consider the second scenario. If Andy shoots at Bob and misses, Bob will 
undoubtedly shoot and hit Charley, since Charley is the more dangerous adversary. 
Then it will be Andy’s turn again. Whether he would see another tomorrow would 
depend on his ability to make that very next shot count. Specifically, 


P(Andy survives) = P(Andy hits Bob on second turn) = 3/10 


But a > 3,so Andy is better off not hitting Bob with his first shot. And because 


13? 


we have already argued that it would be foolhardy for Andy to shoot at Charley, 
Andy’s optimal strategy is clear—deliberately miss both Bob and Charley with the 


first shot. 


Questions 


2.5.11. Suppose that two fair dice (one red and one green) 
are rolled. Define the events 


A: al ora2 shows on the red die 
B: a3, 4, or 5 shows on the green die 
C: the dice total is 4, 11, or 12 


Show that these events satisfy Equation 2.5.3 but not 
Equation 2.5.4. 


2.5.12. A roulette wheel has thirty-six numbers colored 
red or black according to the pattern indicated below: 


Roulette wheel pattern 

5 6 7 8 9 10 11 12 13 14 15 16 17 18 
RBBBBRR 

36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 


Define the events 


A: red number appears 
B: even number appears 
C: number is less than or equal to 18 


Show that these events satisfy Equation 2.5.4 but not 
Equation 2.5.3. 


2.5.13. How many probability equations need to be veri- 
fied to establish the mutual independence of four events? 


2.5.14. In a roll of a pair of fair dice (one red and one 
green), let A be the event the red die shows a 3, 4, or 5; let 
B be the event the green die shows a 1 or a 2; and let C 
be the event the dice total is 7. Show that A, B, and C are 
independent. 


2.5.15. In a roll of a pair of fair dice (one red and one 
green), let A be the event of an odd number on the red 
die, let B be the event of an odd number on the green die, 
and let C be the event that the sum is odd. Show that any 
pair of these events is independent but that A, B, and C 
are not mutually independent. 


2.5.16. On her way to work, a commuter encounters four 
traffic signals. Assume that the distance between each of 
the four is sufficiently great that her probability of getting 
a green light at any intersection is independent of what 
happened at any previous intersection. The first two lights 
are green for forty seconds of each minute; the last two, 
for thirty seconds of each minute. What is the probability 
that the commuter has to stop at least three times? 


2.5.17. School board officials are debating whether to 
require all high school seniors to take a proficiency exam 
before graduating. A student passing all three parts (math- 
ematics, language skills, and general knowledge) would be 
awarded a diploma; otherwise, he or she would receive 
only a certificate of attendance. A practice test given 
to this year’s ninety-five hundred seniors resulted in the 
following numbers of failures: 


Subject Area Number of Students Failing 
Mathematics 3325 
Language skills 1900 
General knowledge 1425 


If “Student fails mathematics,” “Student fails language 
skills,” and “Student fails general knowledge” are inde- 
pendent events, what proportion of next year’s seniors can 


be expected to fail to qualify for a diploma? Does inde- 
pendence seem a reasonable assumption in this situation? 


2.5.18. Consider the following four-switch circuit: 
Sue 
A ee A, 
If all switches operate independently and P (Switch closes) = 
p, what is the probability the circuit is completed? 


2.5.19. A fast-food chain is running a new promotion. For 
each purchase, a customer is given a game card that may 
win $10. The company claims that the probability of a per- 
son winning at least once in five tries is 0.32. What is the 
probability that a customer wins $10 on his or her first 
purchase? 


2.5.20. Players A, B, and C toss a fair coin in order. 
The first to throw a head wins. What are their respective 
chances of winning? 


2.5.21. In a certain third world nation, statistics show 
that only two out of ten children born in the early 1980s 
reached the age of twenty-one. If the same mortality rate 
is operative over the next generation, how many children 
does a woman need to bear if she wants to have at least a 
75% probability that at least one of her offspring survives 
to adulthood? 


2.5.22. According to an advertising study, 15% of tele- 
vision viewers who have seen a certain automobile 
commercial can correctly identify the actor who does the 
voice-over. Suppose that ten such people are watching TV 
and the commercial comes on. What is the probability that 
at least one of them will be able to name the actor? What 
is the probability that exactly one will be able to name the 
actor? 


2.5.23. A fair die is rolled and then n fair coins are tossed, 
where n is the number showing on the die. What is the 
probability that no heads appear? 


2.5.24. Each of m urns contains three red chips and four 
white chips. A total of r samples with replacement are 
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taken from each urn. What is the probability that at least 
one red chip is drawn from at least one urn? 


2.5.25. If two fair dice are tossed, what is the smallest 
number of throws, n, for which the probability of getting 
at least one double 6 exceeds 0.5? (Note: This was one of 
the first problems that de Méré communicated to Pascal 
in 1654.) 


2.5.26. A pair of fair dice are rolled until the first sum of 
8 appears. What is the probability that a sum of 7 does not 
precede that first sum of 8? 


2.5.27. An urn contains w white chips, b black chips, and 
r red chips. The chips are drawn out at random, one at 
a time, with replacement. What is the probability that a 
white appears before a red? 


2.5.28. A Coast Guard dispatcher receives an SOS from 
a ship that has run aground off the shore of a small island. 
Before the captain can relay her exact position, though, 
her radio goes dead. The dispatcher has n helicopter crews 
he can send out to conduct a search. He suspects the ship 
is somewhere either south in area I (with probability p) 
or north in area II (with probability 1 — p). Each of the 
n rescue parties is equally competent and has probabil- 
ity r of locating the ship given it has run aground in the 
sector being searched. How should the dispatcher deploy 
the helicopter crews to maximize the probability that one 
of them will find the missing ship? (Hint: Assume that m 
search crews are sent to area I and n — m are sent to area 
II. Let B denote the event that the ship is found, let A, be 
the event that the ship is in area I, and let A, be the event 
that the ship is in area I]. Use Theorem 2.4.1 to get an 
expression for P(B); then differentiate with respect to m.) 


2.5.29. A computer is instructed to generate a random 
sequence using the digits 0 through 9; repetitions are per- 
missible. What is the shortest length the sequence can be 
and still have at least a 70% probability of containing at 
least one 4? 


2.5.30. A box contains a two-headed coin and eight fair 
coins. One coin is drawn at random and tossed n times. 
Suppose all n tosses come up heads. Show that the limit of 
the probability that the coin is fair is 0 as n goes to infinity. 


2.6 Combinatorics 


Combinatorics is a time-honored branch of mathematics concerned with count- 
ing, arranging, and ordering. While blessed with a wealth of early contributors 
(there are references to combinatorial problems in the Old Testament), its emer- 
gence as a separate discipline is often credited to the German mathematician 
and philosopher Gottfried Wilhelm Leibniz (1646-1716), whose 1666 treatise, Dis- 
sertatio de arte combinatoria, was perhaps the first monograph written on the 


subject (107). 
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Applications of combinatorics are rich in both diversity and number. Users 
range from the molecular biologist trying to determine how many ways genes can be 
positioned along a chromosome, to a computer scientist studying queuing priorities, 
to a psychologist modeling the way we learn, to a weekend poker player wonder- 
ing whether he should draw to a straight, or a flush, or a full house. Surprisingly 
enough, despite the considerable differences that seem to distinguish one question 
from another, solutions to all of these questions are rooted in the same set of four 
basic theorems and rules. 


Counting Ordered Sequences: The Multiplication Rule 


More often than not, the relevant “outcomes” in a combinatorial problem are 
ordered sequences. If two dice are rolled, for example, the outcome (4, 5)—that is, 
the first die comes up 4 and the second die comes up 5—is an ordered sequence 
of length two. The number of such sequences is calculated by using the most 
fundamental result in combinatorics, the multiplication rule. 


Multiplication Rule /f operation A can be performed in m different ways and opera- 
tion B inn different ways, the sequence (operation A, operation B) can be performed 
inm-n different ways. 


Proof At the risk of belaboring the obvious, we can verify the multiplication rule by 
considering a tree diagram (see Figure 2.6.1). Since each version of A can be followed 
by any of n versions of B, and there are m of the former, the total number of “A, B” 
sequences that can be pieced together is obviously the product m -n. 


Operation A Operation B 


2 
1 : 
n 
dl 
2 
2 ; 
n 
. 1 
2 
m . 
n 
Figure 2.6.1 
If operation A;, i=1,2,...,k, can be performed in n; ways, i =1,2,...,k, respec- 
tively, then the ordered sequence (operation A,, operation Az,..., operation Ax) can 
be performed in n,-nz- +++ -ng ways. < 


The combination lock on a briefcase has two dials, each marked off with sixteen 
notches (see Figure 2.6.2). To open the case, a person first turns the left dial in a 
certain direction for two revolutions and then stops on a particular mark. The right 
dial is set in a similar fashion, after having been turned in a certain direction for two 
revolutions. How many different settings are possible? 


Example 
2.6.2 
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Figure 2.6.2 


In the terminology of the multiplication rule, opening the briefcase corresponds 
to the four-step sequence (A), Az, A3, As) detailed in Table 2.6.1. Applying the 
previous corollary, we see that 1024 different settings are possible: 


number of different settings =n; -n2-n3-n4 


=2-16-2-16 
= 1024 
Table 2.6.1 
Operation Purpose Number of Options 
Ay Rotating the left dial in a 
particular direction 2 
Ay Choosing an endpoint for the 
left dial 16 
A3 Rotating the right dial in a 
particular direction 2 
A Choosing an endpoint for the 
right dial 16 


Comment Designers of locks should be aware that the number of dials, as opposed 
to the number of notches on each dial, is the critical factor in determining how 
many different settings are possible. A two-dial lock, for example, where each dial 
has twenty notches, gives rise to only 2 -20-2-20= 1600 settings. If those forty 
notches, though, are distributed among four dials (ten to each dial), the num- 
ber of different settings increases a hundredfold to 160,000 (=2-10-2-10-2- 
10-2- 10). a 


Alphonse Bertillon, a nineteenth-century French criminologist, developed an identi- 
fication system based on eleven anatomical variables (height, head width, ear length, 
etc.) that presumably remain essentially unchanged during an individual’s adult life. 
The range of each variable was divided into three subintervals: small, medium, and 
large. A person’s Bertillon configuration is an ordered sequence of eleven letters, say, 


s,s,m,m,l,s,l,s,s,m,s 


where a letter indicates the individual’s “size” relative to a particular variable. How 
populated does a city have to be before it can be guaranteed that at least two citizens 
will have the same Bertillon configuration? 

Viewed as an ordered sequence, a Bertillon configuration is an eleven-step 
classification system, where three options are available at each step. By the multipli- 
cation rule, a total of 3!!, or 177,147, distinct sequences are possible. Therefore, any 
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2.6.3 


Example 
2.6.4 


city with at least 177,148 adults would necessarily have at least two residents with the 
same pattern. (The limited number of possibilities generated by the configuration’s 
variables proved to be one of its major weaknesses. Still, it was widely used in 
Europe for criminal identification before the development of fingerprinting.) = 


In 1824 Louis Braille invented what would eventually become the standard alpha- 
bet for the blind. Based on an earlier form of “night writing” used by the French 
army for reading battlefield communiqués in the dark, Braille’s system replaced 
each written character with a six-dot matrix: 


where certain dots were raised, the choice depending on the character being 
transcribed. The letter e, for example, has two raised dots and is written 


Punctuation marks, common words, suffixes, and so on, also have specified dot 
patterns. In all, how many different characters can be enciphered in Braille? 


Options 
le 4e —— ee ey 
2 ®@ B® ®@ B® 6 
2 3 T 5) 3 r 5 73 2° Sequences 
a 6° Dot number 
Figure 2.6.3 


Think of the dots as six distinct operations, numbered 1 to 6 (see Figure 2.6.3). In 
forming a Braille letter, we have two options for each dot: We can raise it or not raise 
it. The letter e, for example, corresponds to the six-step sequence (raise, do not raise, 
do not raise, do not raise, raise, do not raise). The number of such sequences, with 
k=6 and nj =n) =---=n6 =2, is 2°, or 64. One of those sixty-four configurations, 
though, has no raised dots, making it of no use to a blind person. Figure 2.6.4 shows 
the entire sixty-three-character Braille alphabet. 

a 


The annual NCAA (“March Madness”) basketball tournament starts with a field 
of sixty-four teams. After six rounds of play, the squad that remains unbeaten is 
declared the national champion. How many different configurations of winners and 
losers are possible, starting with the first round? Assume that the initial pairing of 
the sixty-four invited teams into thirty-two first-round matches has already been 
done. 

Counting the number of ways a tournament of this sort can play out is an 
exercise in applying the multiplication rule twice. Notice, first, that the thirty-two 
first-round games can be decided in 2” ways. Similarly, the resulting sixteen second- 
round games can generate 2! different winners, and so on. Overall, the tournament 
can be pictured as a six-step sequence, where the number of possible outcomes at 


Example 
2.6.5 
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the six steps are 2°*, 2!6, 28, 2+, 27, and 2!, respectively. It follows that the number of 
possible tournaments (not all of which, of course, would be equally likely!) is the 
product 2? 042" -o*.27 3! ora”, = 


In a famous science fiction story by Arthur C. Clarke, “The Nine Billion Names 
of God,” a computer firm is hired by the lamas in a Tibetan monastery to write a 
program to generate all possible names of God. For reasons never divulged, the 
lamas believe that all such names can be written using no more than nine letters. If 
no letter combinations are ruled inadmissible, is the “nine billion” in the story’s title 
a large enough number to accommodate all possibilities? 

No. The lamas are in for a fleecing. The total number of names, N, would be the 
sum of all one-letter names, two-letter names, and so on. By the multiplication rule, 
the number of k-letter names is 26*, so 
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N =26'+267+---+26 = 5,646, 683, 826, 134 


The proposed list of nine billion, then, would be more than 5.6 trillion names short! 
(Note: The discrepancy between the story’s title and the N we just computed is more 
a language difference than anything else. Clarke was British, and the British have 
different names for certain numbers than we have in the United States. Specifically, 
an American trillion is the English’s billion, which means that the American editions 
of Mr. Clarke’s story would be more properly entitled “The Nine Trillion Names of 
God.” A more puzzling question, of course, is why “nine” appears in the title as 
opposed to “six.” a 


Proteins are chains of molecules chosen (with repetition) from some twenty differ- 
ent amino acids. In a living cell, proteins are synthesized through the genetic code, a 
mechanism whereby ordered sequences of nucleotides in the messenger RNA dic- 
tate the formation of a particular amino acid. The four key nucleotides are adenine, 
guanine, cytosine, and uracil (A, G, C, and U). Assuming A, G, C, or U can appear 
any number of times in a nucleotide chain and that all sequences are physically pos- 
sible, what is the minimum length the nucleotides must have if they are to be able to 
encode the amino acids? 

The answer derives from a trial-and-error application of the multiplication rule. 
Given a length r, the number of different nucleotide sequences would be 4”. We are 
looking, then, for the smallest r such that 4” > 20. Clearly, r = 3. 

The entire genetic code for the amino acids is shown in Figure 2.6.5. For 
a discussion of the duplication and the significance of the three missing triplets, 
see (194). 


Alanine GCU, GCC, GCA, GCG Leucine UUA, UUG, CUU, CUC, CUA, CUG 
Arginine CGU, CGC, CGA,CGG,AGA, AGG Lysine AAA, AAG 
Asparagine AAU, AAC Methionine AUG 
Aspartic acid © GAU, GAC Phynylalanine UUU,UUC 
Cysteine UGU, UGC Proline CCU,CCC, CCA, CCG 
Glutamic acid GAA,GAG Serine UCU, UCC, UCA, UCG, AGU, AGC 
Glutamine CAA, CAG Threonine ACU, ACC, ACA, ACG 
Glycine GGU, GGC, GGA, GGG Tryptophan UGG 
Histidine CAU, CAC Tyrosine UAU, UAC 
Isoleucine AUU, AUC, AUA Valine GUU, GUC,GUA,GUG 
Figure 2.6.5 = 


Problem-Solving Hints 


(Doing combinatorial problems) 


Combinatorial questions sometimes call for problem-solving techniques 
that are not routinely used in other areas of mathematics. The three listed below 
are especially helpful. 


1. Draw a diagram that shows the structure of the outcomes that are being 
counted. Be sure to include (or indicate) all relevant variations. A case in 
point is Figure 2.6.3. Almost invariably, diagrams such as these will suggest 
the formula, or combination of formulas, that should be applied. 

2. Use enumerations to “test” the appropriateness of a formula. Typically, 
the answer to a combinatorial problem—that is, the number of ways to 
do something—will be so large that listing all possible outcomes is not 


feasible. It often is feasible, though, to construct a simple, but analo- 
gous, problem for which the entire set of outcomes can be identified (and 
counted). If the proposed formula does not agree with the simple-case enu- 
meration, we know that our analysis of the original question is incorrect. 

3. Ifthe outcomes to be counted fall into structurally different categories, the 
total number of outcomes will be the sum (not the product) of the number 
of outcomes in each category. Recall Example 2.6.5. The categories there 
are the nine different name lengths. 
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2.6.1. A chemical engineer wishes to observe the effects 
of temperature, pressure, and catalyst concentration on 
the yield resulting from a certain reaction. If she intends 
to include two different temperatures, three pressures, 
and two levels of catalyst, how many different runs must 
she make in order to observe each temperature-pressure- 
catalyst combination exactly twice? 


2.6.2. A coded message from a CIA operative to his Rus- 
sian KGB counterpart is to be sent in the form Q4ET, 
where the first and last entries must be consonants; the 
second, an integer 1 through 9; and the third, one of the six 
vowels. How many different ciphers can be transmitted? 


2.6.3. How many terms will be included in the expansion 
of 


(a+b+c)\(d+e+ f)aty+tut+vt+w) 


Which of the following will be included in that number: 
aeu, cdx, bef, xvw? 


2.6.4. Suppose that the format for license plates in a 
certain state is two letters followed by four numbers. 


(a) How many different plates can be made? 

(b) How many different plates are there if the letters can 
be repeated but no two numbers can be the same? 

(c) How many different plates can be made if repeti- 
tions of numbers and letters are allowed except that 
no plate can have four zeros? 


2.6.5. How many integers between 100 and 999 have 
distinct digits, and how many of those are odd numbers? 


2.6.6. A fast-food restaurant offers customers a choice 
of eight toppings that can be added to a hamburger. How 
many different hamburgers can be ordered? 


2.6.7. In baseball there are twenty-four different “base- 
out” configurations (runner on first—two outs, bases 
loaded—none out, and so on). Suppose that a new game, 
sleazeball, is played where there are seven bases (exclud- 
ing home plate) and each team gets five outs an inning. 
How many base-out configurations would be possible in 
sleazeball? 


2.6.8. When they were first introduced, postal zip codes 
were five-digit numbers, theoretically ranging from 00000 
to 99999. (In reality, the lowest zip code was 00601 for San 
Juan, Puerto Rico; the highest was 99950 for Ketchikan, 
Alaska.) An additional four digits have been added, so 
each zip code is now a nine-digit number. How many 
zip codes are at least as large as 60000-0000, are even 
numbers, and have a 7 as their third digit? 


2.6.9. A restaurant offers a choice of four appetizers, 
fourteen entrees, six desserts, and five beverages. How 
many different meals are possible if a diner intends to 
order only three courses? (Consider the beverage to be a 
“course.” 


2.6.10. An octave contains twelve distinct notes (on a 
piano, five black keys and seven white keys). How many 
different eight-note melodies within a single octave can be 
written if the black keys and white keys need to alternate? 


2.6.11. Residents of a condominium have an automatic 
garage door opener that has a row of eight buttons. Each 
garage door has been programmed to respond to a par- 
ticular set of buttons being pushed. If the condominium 
houses 250 families, can residents be assured that no two 
garage doors will open on the same signal? If so, how 
many additional families can be added before the eight- 
button code becomes inadequate? (Note: The order in 
which the buttons are pushed is irrelevant.) 


2.6.12. In international Morse code, each letter in the 
alphabet is symbolized by a series of dots and dashes: the 
letter a, for example, is encoded as “- —”. What is the min- 
imum number of dots and/or dashes needed to represent 
any letter in the English alphabet? 


2.6.13. The decimal number corresponding to a sequence 
of n binary digits do, a, ...,@,-1, Where each a; is either 0 
or 1, is defined to be 


a2: a2 a, 42" 
For example, the sequence 0 1 1 0 is equal to 6 (= 


0-2°+1-2'4+1-2?+0-23). Suppose a fair coin is tossed 
nine times. Replace the resulting sequence of H’s and 
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T’s with a binary sequence of 1’s and 0’s (1 for H, 0 
for T). For how many sequences of tosses will the deci- 
mal corresponding to the observed set of heads and tails 
exceed 256? 


2.6.14. Given the letters in the word 
ZOMBIES 


in how many ways can two of the letters be arranged such 
that one is a vowel and one is a consonant? 


2.6.15. Suppose that two cards are drawn—in order— 
from a standard 52-card poker deck. In how many ways 
can the first card be a club and the second card be an ace? 


2.6.16. Monica’s vacation plans require that she fly from 
Nashville to Chicago to Seattle to Anchorage. According 
to her travel agent, there are three available flights from 
Nashville to Chicago, five from Chicago to Seattle, and 
two from Seattle to Anchorage. Assume that the numbers 
of options she has for return flights are the same. How 
many round-trip itineraries can she schedule? 


Counting Permutations (when the objects are all distinct) 


Ordered sequences arise in two fundamentally different ways. The first is the sce- 
nario addressed by the multiplication rule—a process is comprised of k operations, 
each allowing n; options, i= 1,2,...,k; choosing one version of each operation leads 
to njn2...nx possibilities. 

The second occurs when an ordered arrangement of some specified length k is 
formed from a finite collection of objects. Any such arrangement is referred to as a 
permutation of length k. For example, given the three objects A, B, and C, there are 
six different permutations of length two that can be formed if the objects cannot be 
repeated: AB, AC, BC, BA, CA, and CB. 


Theorem The number of permutations of length k that can be formed from a set of n distinct 
2.6.1 elements, repetitions not allowed, is denoted by the symbol ,, Py, where 
n! 
os laa a ae a a ET 
Proof Any of the n objects may occupy the first position in the arrangement, any 
of n — 1 the second, and so on—the number of choices available for filling the kth 
position will be n —k +1 (see Figure 2.6.6). The theorem follows, then, from the 
multiplication rule: There will be n(n — 1)--- (a —k + 1) ordered arrangements. 
we. «It n-1 n-(k-2) n-(k-1) 
Choices: i 5 a] rs 
Position in sequence 
Figure 2.6.6 
Corollary The number of ways to permute an entire set of n distinct objects is , Py, =n(n — 1) 
2.6.2 (n—2)---l=nl. < 
Example How many permutations of length k = 3 can be formed from the set of n = 4 distinct 
2.6.7 elements, A, B,C, and D? 


According to Theorem 2.6.1, the number should be 24: 
n} 4! 4-3-2-1 
(n—k)! (4-3)! 1 


Example 
2.6.8 


Example 
2.6.9 
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Confirming that figure, Table 2.6.2 lists the entire set of 24 permutations and 
illustrates the argument used in the proof of the theorem. 


Table 2.6.2 
C 1. (ABC) 
By 2. (ABD) 
B 3. (ACB) 
- Cp 4. (ACD) 
B 5. (ADB) 
ir. 6. (ADC) 
Cc 7. (BAO) 
Ap 8. (BAD) 


A 9. (BCA) 
Cy in, (BCD) 
A ll. (BDA) 
Deg 45, (BDC) 
B13. (CAB) 
Ay ia, (CAD) 
A 15. (CBA) 
ae (CBD) 
A 17. (CDA) 
Da ig, (CDB) 
B19. (DAB) 
AG 20. (DAC) 
A 21. (DBA) 
B= (DBC) 
A 23. (DCA) 
C— 24. (DCB) 


INN 


In her sonnet with the famous first line, “How do I love thee? Let me count the 
ways,” Elizabeth Barrett Browning listed eight ways. Suppose Ms. Browning had 
decided that writing greeting cards afforded her a better format for expressing her 
feelings. For how many years could she have corresponded with her favorite beau 
on a daily basis and never sent the same card twice? Assume that each card contains 
exactly four of the eight “ways” and that order matters. 

In selecting the verse for a card, Ms. Browning would be creating a permutation 
of length k = 4 from a set of n = 8 distinct objects. According to Theorem 2.6.1, 


number of different cards = g Py = =8-7-6-5 


8! 
(8-4)! 
= 1680 


At the rate of a card a day, she could have kept the correspondence going for more 
than four and one-half years. a 


Years ago—long before Rubik’s Cubes and electronic games had become 
epidemic—puzzles were much simpler. One of the more popular combinatorial- 
related diversions was a four-by-four grid consisting of fifteen movable squares 
and one empty space. The object was to maneuver as quickly as possible an arbi- 
trary configuration (Figure 2.6.7a) into a specific pattern (Figure 2.6.7b). How many 
different ways could the puzzle be arranged? 
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Take the empty space to be square number 16 and imagine the four rows of 
the grid laid end to end to make a sixteen-digit sequence. Each permutation of that 
sequence corresponds to a different pattern for the grid. By the corollary to The- 
orem 2.6.1, the number of ways to position the tiles is 16!, or more than twenty 
trillion (20,922,789,888,000, to be exact). That total is more than fifty times the num- 
ber of stars in the entire Milky Way galaxy. (Note: Not all of the 16! permutations 
can be generated without physically removing some of the tiles. Think of the two- 
by-two version of Figure 2.6.7 with tiles numbered 1 through 3. How many of the 4! 
theoretical configurations can actually be formed?) 


Figure 2.6.7 = 


A deck of fifty-two cards is shuffled and dealt face up in a row. For how many 
arrangements will the four aces be adjacent? 

This is a good example illustrating the problem-solving benefits that come from 
drawing diagrams, as mentioned earlier. Figure 2.6.8 shows the basic structure that 
needs to be considered: The four aces are positioned as a “clump” somewhere 
between or around the forty-eight non-aces. 


Non-aces 


4 aces 


Figure 2.6.8 


Clearly, there are forty-nine “spaces” that could be occupied by the four aces (in 
front of the first non-ace, between the first and second non-aces, and so on). Further- 
more, by the corollary to Theorem 2.6.1, once the four aces are assigned to one of 
those forty-nine positions, they can still be permuted in 4 Py = 4! ways. Similarly, the 
forty-eight non-aces can be arranged in 43 P4g = 48! ways. It follows from the multi- 
plication rule, then, that the number of arrangements having consecutive aces is the 
product 49 - 4!- 48!, or, approximately, 1.46 x 10%. = 


Comment Computing n! can be quite cumbersome, even for n’s that are fairly 
small: We saw in Example 2.6.9, for instance, that 16! is already in the trillions. For- 
tunately, an easy-to-use approximation is available. According to Stirling's formula, 


ni=V2rn"t/e" 


Example 
2.6.11 
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In practice, we apply Stirling’s formula by writing 


1 
logo (1!) = logy) (W227) + (x ae 5) log,9(n) — nlog,o(e) 


and then exponentiating the right-hand side. 
In Example 2.6.10, the number of arrangements was calculated to be 49 - 4!- 48}, 
or 24-49!. Substituting into Stirling’s formula, we can write 


1 
log; (49!) = logy (V 2r) + (49 + 5) log; (49) — 49 logjo(e) 


® 62.783366 
Therefore, 
24.49! = 24. 10-7837 
= 1.46 x 10% 


In chess a rook can move vertically and horizontally (see Figure 2.6.9). It can capture 
any unobstructed piece located anywhere in its own row or column. In how many 
ways can eight distinct rooks be placed on a chessboard (having eight rows and eight 
columns) so that no two can capture one another? 


y 


Figure 2.6.9 = 


To start with a simpler problem, suppose that the eight rooks are all identical. 
Since no two rooks can be in the same row or same column (why?), it follows that 
each row must contain exactly one. The rook in the first row, however, can be in any 
of eight columns; the rook in the second row is then limited to being in one of seven 
columns, and so on. By the multiplication rule, then, the number of noncapturing 
configurations for eight identical rooks is g Pg, or 8! (see Figure 2.6.10). 

Now imagine the eight rooks to be distinct—they might be numbered, for exam- 
ple, 1 through 8. The rook in the first row could be marked with any of eight 
numbers; the rook in the second row with any of the remaining seven numbers; and 
so on. Altogether, there would be 8! numbering patterns for each configuration. The 
total number of ways to position eight distinct, noncapturing rooks, then, is 8! - 8!, or 
1,625,702,400. 
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Total number = 
8°7°6°5°4°3-2-1 


b= 
NW BMW wA A 


Figure 2.6.10 
Example A new horror movie, Friday the 13th, Part X, will star Jason’s great-grandson (also 
2.6.12 named Jason) as a psychotic trying to dispatch (as gruesomely as possible) eight 


camp counselors, four men and four women. (a) How many scenarios (i.e., victim 
orders) can the screenwriters devise, assuming they want Jason to do away with all 
the men before going after any of the women? (b) How many scripts are possible if 
the only restriction imposed on Jason is that he save Muffy for last? 


a. Suppose the male counselors are denoted A, B, C, and D and the female coun- 
selors, W, X, Y, and Z. Among the admissible plots would be the sequence 
pictured in Figure 2.6.11, where B is done in first, then D, and so on. The men, 
if they are to be restricted to the first four positions, can still be permuted in 
4P4=4! ways. The same number of arrangements can be found for the women. 
Furthermore, the plot in its entirety can be thought of as a two-step sequence: 
first the men are eliminated, then the women. Since 4! ways are available to 
do the former and 4! the latter, the total number of different scripts, by the 
multiplication rule, is 4!4!, or 576. 


Men Women 
BDA C Y ZW XxX 
1 2 3 4 5 6 7 8 

Order of killing 


Figure 2.6.11 


b. If the only condition to be met is that Muffy be dealt with last, the number of 
admissible scripts is simply 7 P; = 7!, that being the number of ways to permute 
the other seven counselors (see Figure 2.6.12). 


B W ZC Y A D Muffy 
1 2 3 4 567 + 8 


Order of killing 
Figure 2.6.12 o 
Example Consider the set of nine-digit numbers that can be formed by rearranging without 


2.6.13 repetition the integers 1 through 9. For how many of those permutations will the 1 
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and the 2 precede the 3 and the 4? That is, we want to count sequences like 7 2 5 1 3 
69 48 but not like 681542739. 

At first glance, this seems to be a problem well beyond the scope of Theo- 
rem 2.6.1. With the help of a symmetry argument, though, its solution is surprisingly 


simple. 


Think of just the digits 1 through 4. By the corollary on p. 74, those four numbers 
give rise to 4!(= 24) permutations. Of those twenty-four, only four—(1, 2, 3, 4), (2, 1, 
3, 4), (1, 2, 4, 3), and (2, 1, 4, 3) have the property that the 1 and the 2 come before 
the 3 and the 4. It follows that 4 of the total number of nine-digit permutations 
should satisfy the condition being imposed on 1, 2, 3, and 4. Therefore, 


4 
number of permutations where 1 and 2 precede 3 and 4= Th 9! 


Questions 


2.6.17. The board of a large corporation has six mem- 
bers willing to be nominated for office. How many dif- 
ferent “president/vice president/treasurer” slates could be 
submitted to the stockholders? 


2.6.18. How many ways can a set of four tires be put on 
a car if all the tires are interchangeable? How many ways 
are possible if two of the four are snow tires? 


2.6.19. Use Stirling’s formula to approximate 30!. 
(Note: The exact answer is 265,252,859,812,268,935,315,188, 
480,000,000.) 


2.6.20. The nine members of the music faculty base- 
ball team, the Mahler Maulers, are all incompetent, and 
each can play any position equally poorly. In how many 
different ways can the Maulers take the field? 


2.6.21. A three-digit number is to be formed from the dig- 
its 1 through 7, with no digit being used more than once. 
How many such numbers would be less than 289? 


2.6.22. Four men and four women are to be seated in a 
row of chairs numbered 1 through 8. 


(a) How many total arrangements are possible? 
(b) How many arrangements are possible if the men are 
required to sit in alternate chairs? 


2.6.23. An engineer needs to take three technical elec- 
tives sometime during his final four semesters. The three 
are to be selected from a list of ten. In how many ways can 
he schedule those classes, assuming that he never wants to 
take more than one technical elective in any given term? 


2.6.24. How many ways can a twelve-member cheer- 
leading squad (six men and six women) pair up to form 


= 60,480 a 


six male-female teams? How many ways can six male- 
female teams be positioned along a sideline? What might 
the number 6!6!2° represent? What might the number 
6!6!2°2"" represent? 


2.6.25. Suppose that a seemingly interminable German 
opera is recorded on all six sides of a three-record album. 
In how many ways can the six sides be played so that at 
least one is out of order? 


2.6.26. A group of n families, each with m members, are 
to be lined up for a photograph. In how many ways can the 
nm people be arranged if members of a family must stay 
together? 


2.6.27. Suppose that ten people, including you and a 
friend, line up for a group picture. How many ways 
can the photographer rearrange the line if she wants 
to keep exactly three people between you and your 
friend? 


2.6.28. Use an induction argument to prove Theo- 
rem 2.6.1. (Note: This was the first mathematical result 
known to have been proved by induction. It was done in 
1321 by Levi ben Gerson.) 


2.6.29. In how many ways can a pack of fifty-two cards be 
dealt to thirteen players, four to each, so that every player 
has one card of each suit? 


2.6.30. If the definition of n! is to hold for all nonnegative 
integers n, show that it follows that 0! must equal 1. 


2.6.31. The crew of Apollo 17 consisted of a pilot, a 
copilot, and a geologist. Suppose that NASA had actu- 
ally trained nine aviators and four geologists as candidates 
for the flight. How many different crews could they have 
assembled? 
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2.6.32. Uncle Harry and Aunt Minnie will both be attend- 
ing your next family reunion. Unfortunately, they hate 
each other. Unless they are seated with at least two people 
between them, they are likely to get into a shouting match. 
The side of the table at which they will be seated has seven 
chairs. How many seating arrangements are available for 
those seven people if a safe distance is to be maintained 
between your aunt and your uncle? 


2.6.33. In how many ways can the digits 1 through 9 be 
arranged such that 


(a) all the even digits precede all the odd digits? 

(b) all the even digits are adjacent to each other? 

(c) two even digits begin the sequence and two even 
digits end the sequence? 

(d) the even digits appear in either ascending or 


Theorem 
2.6.2 


descending order? 


Counting Permutations (when the objects are not all distinct) 


The corollary to Theorem 2.6.1 gives a formula for the number of ways an entire set 
of n objects can be permuted if the objects are all distinct. Fewer than n! permutations 
are possible, though, if some of the objects are identical. For example, there are 3! =6 
ways to permute the three distinct objects A, B, and C: 


ABC 
ACB 
BAC 
BCA 
CAB 
CBA 


If the three objects to permute, though, are A, A, and B—that is, if two of the three 
are identical—the number of permutations decreases to three: 


AAB 
ABA 
BAA 


As we will see, there are many real-world applications where the n objects to be 
permuted belong to r different categories, each category containing one or more 
identical objects. 


The number of ways to arrange n objects, n, being of one kind, nz of a second 
kind, ..., and n, of an rth kind, is 


n! 


n!no!---n,! 


where >> n; =n. 
i=l 

Proof Let N denote the total number of such arrangements. For any one of those N, 

the similar objects (if they were actually different) could be arranged in n;!n!---n,! 

ways. (Why?) It follows that N -nj!no!---n,! is the total number of ways to arrange 

n (distinct) objects. But n! equals that same number. Setting N -n,!n2!---n,! equal 

to n! gives the result. 


Example 
2.6.14 


Example 
2.6.15 
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Comment Ratios like n!/(n,!n2!---n,!) are called multinomial coefficients because 
the general term in the expansion of 


(x1 +x2++ + +2,)" 


is 


A pastry in a vending machine costs 85¢. In how many ways can a customer put in 
two quarters, three dimes, and one nickel? 


a a ee 
os | i 
he . px, 
3 7 ee. a &/ 
= a 
1 2 3 4 5 e 


Order in which coins are deposited 


Figure 2.6.13 


If all coins of a given value are considered identical, then a typical deposit 
sequence, say, QDDQND (see Figure 2.6.13), can be thought of as a permutation 
of n = 6 objects belonging to r = 3 categories, where 


n, =number of nickels = 1 
n2 =number of dimes = 3 
n3 =number of quarters = 2 
By Theorem 2.6.2, there are sixty such sequences: 
n! 6! 
ny!noing! 113!2) 


Of course, had we assumed the coins were distinct (having been minted at different 
places and different times), the number of distinct permutations would have been 
6!, or 720. a 


60 


Prior to the seventeenth century there were no scientific journals, a state of affairs 
that made it difficult for researchers to document discoveries. If a scientist sent a 
copy of his work to a colleague, there was always a risk that the colleague might 
claim it as his own. The obvious alternative — wait to get enough material to publish a 
book—invariably resulted in lengthy delays. So, as a sort of interim documentation, 
scientists would sometimes send each other anagrams—letter puzzles that, when 
properly unscrambled, summarized in a sentence or two what had been discovered. 

When Christiaan Huygens (1629-1695) looked through his telescope and saw 
the ring around Saturn, he composed the following anagram (191): 


nnnnnnnnn, oooo, pp,g,rr, 8, ttttt, uuuuu 


How many ways can the sixty-two letters in Huygens’s anagram be arranged? 
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Example 


2.6.16 


Let n|(= 7) denote the number of a’s, n2(=5) the number of c’s, and so on. 
Substituting into the appropriate multinomial coefficient, we find 


62! 
= TISILISYLILITI4I2I9 14121 IDI TIS!5! 


as the total number of arrangements. To get a feeling for the magnitude of NV, we 
need to apply Stirling’s formula to the numerator. Since 


62! = /27e 62> 
then 


log(62!) = log (Vv 27) — 62-log(e) + 62.5 - log(62) 
= 85.49731 


The antilog of 85.49731 is 3.143 x 10°, so 


: 3.143 x 10° 
N= FiSTISHIITIAIDIOIAIONIDITISISI 


is a number on the order of 3.6 x 10°. Huygens was clearly taking no chances! 
(Note: When appropriately rearranged, the anagram becomes “Annulo cingitur 
tenui, plano, nusquam cohaerente, ad eclipticam inclinato,” which translates to 
“Surrounded by a thin ring, flat, suspended nowhere, inclined to the ecliptic.”) mm 


What is the coefficient of x”* in the expansion of (1 +.x> + x°)!? 
To understand how this question relates to permutations, consider the simpler 
problem of expanding (a + b)’: 


(a+b) =(a+b)(a+b) 
=a-a+a-b+b-a+b-b 
=a’ +2ab+b* 
Notice that each term in the first (a+) is multiplied by each term in the second 
(a+b). Moreover, the coefficient that appears in front of each term in the expansion 
corresponds to the number of ways that that term can be formed. For example, the 


2 in the term 2ab reflects the fact that the product ab can result from two different 
multiplications: 


(a+b)(a+b) or (a+b)(a+b) 
— sya 
ab ab 


By analogy, the coefficient of x”? in the expansion of (1 + .x° + x)! will be the 
number of ways that one term from each of the one hundred factors (1 +. x° + x?) can 
be multiplied together to form x3. The only factors that will produce x7, though, 
are the set of two x°’s, one x°, and ninety-seven 1’s: 


Sale aa aa coe es ee | 


It follows that the coefficient of x?> is the number of ways to permute two x° 


x°, and ninety-seven 1’s. So, from Theorem 2.6.2, 


*s, one 


Example 
2.6.17 


Example 
2.6.18 
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100! 
coefficient of x7? = ay 
2!1197! 


= 485, 100 = 


A palindrome is a phrase whose letters are in the same order whether they are read 
backward or forward, such as Napoleon’s lament 


Able was I ere I saw Elba. 
or the often-cited 
Madam, I’m Adam. 
Words themselves can become the units in a palindrome, as in the sentence 


Girl, bathing on Bikini, eyeing boy, 
finds boy eyeing bikini on bathing girl. 


Suppose the members of a set consisting of four objects of one type, six of a sec- 
ond type, and two of a third type are to be lined up in a row. How many of those 
permutations are palindromes? 

Think of the twelve objects to arrange as being four A’s, six B’s, and two C’s. 
If the arrangement is to be a palindrome, then half of the A’s, half of the B’s, and 
half of the C’s must occupy the first six positions in the permutation. Moreover, the 
final six members of the sequence must be in the reverse order of the first six. For 
example, if the objects comprising the first half of the permutation were 


C A B A B B 
then the last six would need to be in the order 
B B A BAC 


It follows that the number of palindromes is the number of ways to permute the 
first six objects in the sequence, because once the first six are positioned, there is only 
one arrangement of the last six that will complete the palindrome. By Theorem 2.6.2, 
then, 


number of palindromes = 6!/(2!3!1!) = 60 a] 


A deliveryman is currently at Point X and needs to stop at Point 0 before driv- 
ing through to Point Y (see Figure 2.6.14). How many different routes can he take 
without ever going out of his way? 

Notice that any admissible path from, say, X to 0 is an ordered sequence of 11 
“moves” —nine east and two north. Pictured in Figure 2.6.14, for example, is the 
particular X to 0 route 


E E N FE E E E N E E E 


Similarly, any acceptable path from 0 to Y will necessarily consist of five moves east 
and three moves north (the one indicatedis E EN NEN E E). 
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Example 
2.6.19 


Figure 2.6.14 


Since each path from X to 0 corresponds to a unique permutation of nine E’s 
and two N’s, the number of such paths (from Theorem 2.6.2) is the quotient 


11!/(9!2!) =55 
For the same reasons, the number of different paths from 0 to Y is 
8!/(5!3!) = 56 


By the multiplication rule, then, the total number of admissible routes from X to Y 
that pass through 0 is the product of 55 and 56, or 3080. o 


A burglar is trying to deactivate an alarm system that has a six-digit entry code. He 
notices that three of the keyboard buttons—the 3, the 4, and the 9—are more pol- 
ished than the other seven, suggesting that only those three numbers appear in the 
correct entry code. Trial and error may be a feasible strategy, but earlier misadven- 
tures have convinced him that if his probability of guessing the correct code in the 
first thirty minutes is not at least 70%, the risk of getting caught is too great. Given 
that he can try a different permutation every five seconds, what should he do? He 
could look for an unlocked window to crawl through (or, here’s a thought, get an 
honest job!). Deactivating the alarm, though, is not a good option. 

Table 2.6.3 shows that 570 six-digit permutations can be made from the numbers 
3, 4, and 9. 


Table 2.6.3 

Form of Permutations Example Number 

One digit appears four 449434 6!/(4!1!1!) x 3=90 
times; other digits appear 
once 

One digit appears three 944334 6! /(3!2!1!) x 3! = 360 
times; another appears 
twice; and a third appears 
once 

Each digit appears twice 439934 = 6! /(2!2!2!) x 1= 120 

TOTAL: 570 
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Guessing at the rate of one permutation every five seconds would allow 360 
permutations to be tested in thirty minutes, but 360 is only 63% of 570, so the bur- 
glar’s 70% probability criteria of success would not be met. (Question: The first 
factors in Column 3 of Table 2.6.3 are applications of Theorem 2.6.2 to the sam- 
ple permutations shown in Column 2. What do the second factors in Column 3 


represent?) 


Questions 


2.6.34. Which state name can generate more permuta- 
tions, TENNESSEE or FLORIDA? 


2.6.35. How many numbers greater than four million can 
be formed from the digits 2, 3, 4, 4, 5,5, 5? 


2.6.36. An interior decorator is trying to arrange a shelf 
containing eight books, three with red covers, three with 
blue covers, and two with brown covers. 


(a) Assuming the titles and the sizes of the books are 
irrelevant, in how many ways can she arrange the 
eight books? 

(b) In how many ways could the books be arranged if 
they were all considered distinct? 

(c) In how many ways could the books be arranged if the 
red books were considered indistinguishable, but the 
other five were considered distinct? 


2.6.37. Four Nigerians (A, B, C, D), three Chinese (#, *, 
&), and three Greeks (a, 8, y) are lined up at the box 
office, waiting to buy tickets for the World’s Fair. 


(a) How many ways can they position themselves if the 
Nigerians are to hold the first four places in line; 
the Chinese, the next three; and the Greeks, the last 
three? 

(b) How many arrangements are possible if members of 
the same nationality must stay together? 

(c) How many different queues can be formed? 

(d) Suppose a vacationing Martian strolls by and wants 
to photograph the ten for her scrapbook. A bit 
myopic, the Martian is quite capable of discerning 
the more obvious differences in human anatomy 
but is unable to distinguish one Nigerian (NV) from 
another, one Chinese (C) from another, or one 
Greek (G) from another. Instead of perceiving a 
line to be Bk BAD#&Cay, for example, she would 
see NCGNNCCNGG. From the Martian’s perspec- 
tive, in how many different ways can the ten funny- 
looking Earthlings line themselves up? 


2.6.38. How many ways can the letters in the word 
SLUMGULLION 


be arranged so that the three L’s precede all the other 
consonants? 


2.6.39. A tennis tournament has a field of 2n entrants, all 
of whom need to be scheduled to play in the first round. 
How many different pairings are possible? 


2.6.40. What is the coefficient of x? in the expansion of 

(1+x3 +.x°)!8? 

2.6.41. In how many ways can the letters of the word 
ELEEMOSYNARY 

be arranged so that the S is always immediately followed 

byaY? 


2.6.42. In how many ways can the word ABRA- 
CADABRA be formed in the array pictured below? 
Assume that the word must begin with the top A and 
progress diagonally downward to the bottom A. 


A 

7 
B B 
4 

R R R 

A 
A A A A 
s 
a c Cc c C 
\ 
A A A A A A 
\ 
D D D D D 
a 
A A A A 
B i 3s 
x 
R R 
\ 
A 


2.6.43. Suppose a pitcher faces a batter who never swings. 
For how many different ball/strike sequences will the 
batter be called out on the fifth pitch? 


2.6.44. What is the coefficient of w?x* yz? in the expansion 
of (w+x+t+y+z)?? 


2.6.45. Imagine six points in a plane, no three of which 
lie on a straight line. In how many ways can the six points 
be used as vertices to form two triangles? (Hint: Number 
the points 1 through 6. Call one of the triangles A and the 
other B. What does the permutation 
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A A B B A B 
1 2 3 4 5 6 
represent?) 


2.6.48. Make an anagram out of the familiar expression 
STATISTICS IS FUN. In how many ways can the letters 
in the anagram be permuted? 


2.6.46. Show that (k!)! is divisible by k!@~)". (Hint: Think ; ; , 
of a related permutation problem whose solution would 2.6.49. Linda is taking a five-course load her first 


require Theorem 2.6.2.) 


2.6.47. In how many ways can the letters of the word 


semester: English, math, French, psychology, and history. 
In how many different ways can she earn three A’s and 
two B’s? Enumerate the entire set of possibilities. Use 


B ROBDI N GNAGIAN Theorem 2.6.2 to verify your answer. 
be arranged without changing the order of the vowels? 


Theorem 
2.6.3 


Counting Combinations 


Order is not always a meaningful characteristic of a collection of elements. Consider 
a poker player being dealt a five-card hand. Whether he receives a 2 of hearts, 4 of 
clubs, 9 of clubs, jack of hearts, and ace of diamonds in that order, or in any one of the 
other 5! — 1 permutations of those particular five cards is irrelevant —the hand is still 
the same. As the last set of examples in this section bears out, there are many such 
situations— problems where our only legitimate concern is with the composition of 
a set of elements, not with any particular arrangement of them. 

We call a collection of k unordered elements a combination of size k. For exam- 
ple, given a set of n = 4 distinct elements— A, B, C, and D—there are six ways to 
form combinations of size 2: 


AandB BandC 
AandC BandD 
AandD CandD 


A general formula for counting combinations can be derived quite easily from what 
we already know about counting permutations. 


The number of ways to form combinations of size k from a set of n distinct objects, 
repetitions not allowed, is denoted by the symbols Ce) or ,Cy, where 


(()aaCe= 


Proof Let the symbol (7) denote the number of combinations satisfying the condi- 
tions of the theorem. Since each of those combinations can be ordered in k! ways, the 
product k! (1) must equal the number of permutations of length k that can be formed 
from n distinct elements. But n distinct elements can be formed into permutations 
of length k inn(n — 1)---(2—k +1) =n!/(n—k)! ways. Therefore, 


HQ) = arm 


Solving for ({) gives the result. 


Comment It often helps to think of combinations in the context of drawing objects 
out of an urn. If an urn contains n chips labeled 1 through n, the number of ways 
we can reach in and draw out different samples of size k is (7). In deference to 


Example 
2.6.20 


Example 
2.6.21 
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n 


this sampling interpretation for the formation of combinations, (/ 


“n things taken k at a time” or “n choose k.” 


) is usually read 


Comment The symbol ('') appears in the statement of a familiar theorem from 
algebra, 


n 
n 
x+y)'= ( ie n—k 
+=), ary 
k=0 
Since the expression being raised to a power involves two terms, x and y, the 
constants (Os k=0,1,...,n, are commonly referred to as binomial coefficients. 


Eight politicians meet at a fund-raising dinner. How many greetings can be 
exchanged if each politician shakes hands with every other politician exactly once? 

Imagine the politicians to be eight chips—1 through 8—in an urn. A handshake 
corresponds to an unordered sample of size 2 chosen from that urn. Since repeti- 
tions are not allowed (even the most obsequious and overzealous of campaigners 
would not shake hands with himself!), Theorem 2.6.3 applies, and the total number 


of handshakes is 
8 8! 
(5) ~ 216! 
or 28. a 


A chemist is trying to synthesize a part of a straight-chain aliphatic hydrocarbon 
polymer that consists of twenty-one radicals—ten ethyls (£), six methyls (M), and 
five propyls (P). Assuming all arrangements of radicals are physically possible, how 
many different polymers can be formed if no two of the methyl radicals are to be 
adjacent? 

Imagine arranging the E’s and the P’s without the M’s. Figure 2.6.15 shows one 
such possibility. Consider the sixteen “spaces” between and outside the E’s and P’s 
as indicated by the arrows in Figure 2.6.15. In order for the M’s to be nonadjacent, 
they must occupy any six of these locations. But those six spaces can be chosen in 


(2 ) ways. And for each of the ( ) positionings of the M’s, the E’s and P’s can be 


permuted in 7%, ways (Theorem 2.6.2). 


(Cia = eee 


Figure 2.6.15 


So, by the multiplication rule, the total number of polymers having nonadjacent 
methyl radicals is 24,048,024: 


| iu SeCe = (8008) (3003) = 24, 048, 024 
6) 1015! 10!6! 10!5! — a 
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Example 
2.6.22 


Example 


2.6.23 


Binomial coefficients have many interesting properties. Perhaps the most familiar is 
Pascal’s triangle,' a numerical array where each entry is equal to the sum of the two 
numbers appearing diagonally above it (see Figure 2.6.16). Notice that each entry in 
Pascal’s triangle can be expressed as a binomial coefficient, and the relationship just 
described appears to reduce to a simple equation involving those coefficients: 


CE eG) (2.6.1) 


Prove that Equation 2.6.1 holds for all positive integers n and k. 


Row 
1 0 G ) 
1 1 1 ( a ( i ) 
1 2 1 2 Gy Gy ©) 
1 3 3 1 3 CG). Gy “Cy a 
1 4 6 4 1 4 Gr Gs ©) Gi «Dp 
Figure 2.6.16 

Consider a set of n+ 1 distinct objects Aj, Az, ..., An+i. We can obviously draw 

samples of size k from that set in Ce) different ways. Now, consider any particular 


object—for example, A;. Relative to A;, each of those ‘es samples belongs to one 


of two categories: those containing A; and those not containing A;. To form sam- 
ples containing A;, we need to select k — 1 additional objects from the remaining n. 


This can be done in (,",) ways. Similarly, there are (/) ways to form samples not 


containing A,. Therefore, (";') must equal (7) + (,",). ia 


The answers to combinatorial questions can sometimes be obtained using quite dif- 
ferent approaches. What invariably distinguishes one solution from another is the 
way in which outcomes are characterized. 

For example, suppose you have just ordered a roast beef sub at a sandwich 
shop, and now you need to decide which, if any, of the available toppings (lettuce, 
tomato, onions, etc.) to add. If the shop has eight “extras” to choose from, how many 
different subs can you order? 

One way to answer this question is to think of each sub as an ordered sequence 
of length eight, where each position in the sequence corresponds to one of the top- 
pings. At each of those positions, you have two choices—‘“‘add” or “do not add” that 
particular topping. Pictured in Figure 2.6.17 is the sequence corresponding to the sub 
that has lettuce, tomato, and onion but no other toppings. Since two choices (‘“‘add” 
or “do not add”) are available for each of the eight toppings, the multiplication rule 


! Despite its name, Pascal’s triangle was not discovered by Pascal. Its basic structure had been known hundreds 
of years before the French mathematician was born. It was Pascal, though, who first made extensive use of its 
properties. 
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Add? 
Y Y Y N N N N N 


Lettuce Tomato Onion Mustard Relish Mayo Pickles Peppers 
Figure 2.6.17 


tells us that the number of different roast beef subs that could be requested is 2°, 
or 256. 

An ordered sequence of length eight, though, is not the only model capable of 
characterizing a roast beef sandwich. We can also distinguish one roast beef sub from 
another by the particular combination of toppings that each one has. For example, 


there are (3) = 70 different subs having exactly four toppings. It follows that the 


total number of different sandwiches is the total number of different combinations 
of size k, where k ranges from 0 to 8. Reassuringly, that sum agrees with the ordered 


sequence answer: 


total number of different roast beef subs = (5) + (7) + (5) feet (5) 


0 I 2 8 
=14+8+428+---+1 
= 256 


What we have just illustrated here is another property of binomial coefficients — 


namely, that 


n 


> ) =98 (2.6.2) 


k=0 


The proof of Equation 2.6.2 is a direct consequence of Newton’s binomial expansion 
(see the second comment following Theorem 2.6.3). = 


Questions 


2.6.50. How many straight lines can be drawn between 
five points (A, B, C, D, and E£), no three of which are 
collinear? 


2.6.51. The Alpha Beta Zeta sorority is trying to fill 
a pledge class of nine new members during fall rush. 
Among the twenty-five available candidates, fifteen have 
been judged marginally acceptable and ten highly desir- 
able. How many ways can the pledge class be chosen to 
give a two-to-one ratio of highly desirable to marginally 
acceptable candidates? 


2.6.52. A boat has a crew of eight: Two of those eight can 
row only on the stroke side, while three can row only on 
the bow side. In how many ways can the two sides of the 
boat be manned? 


2.6.53. Nine students, five men and four women, inter- 
view for four summer internships sponsored by a city 
newspaper. 


(a) In how many ways can the newspaper choose a set of 
four interns? 

(b) In how many ways can the newspaper choose a set 
of four interns if it must include two men and two 
women in each set? 

(c) How many sets of four can be picked such that not 
everyone in a set is of the same sex? 


2.6.54. The final exam in History 101 consists of five essay 
questions that the professor chooses from a pool of seven 
that are given to the students a week in advance. For how 
many possible sets of questions does a student need to be 
prepared? In this situation, does order matter? 


2.6.55. Ten basketball players meet in the school gym for 
a pickup game. How many ways can they form two teams 
of five each? 


2.6.56. Your statistics teacher announces a twenty-page 
reading assignment on Monday that is to be finished by 
Thursday morning. You intend to read the first x, pages 
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Monday, the next x. pages Tuesday, and the final x; pages 2.6.60. Show that 
Wednesday, where x, + x2. + x3 = 20, and each x; > 1. In 


how many ways can you complete the assignment? That (") 4 (") He gi os) + () renee 
is, how many different sets of values can be chosen for x,, 1 3 0 2 
xX, and x3? 


: (Hint: Consider the expansion of (x — y)”.) 
2.6.57. In how many ways can the letters in 


2.6.61. Prove that successive terms in the sequence Cie 
MISSISSIPPI 


(1), ..-, (“) first increase and then decrease. [Hint: Exam- 


ine the ratio of two successive terms, Ga) / (") | 


be arranged so that no two I’s are adjacent? 


2.6.58. Prove that > (,) = 2". (Hint: Use the binomial 2.6.62. Mitch is trying to add a little zing to his cabaret act 
7 by telling four jokes at the beginning of each show. His cur- 


expansion mentioned on p. 87. 
p ST) rent engagement is booked to run four months. If he gives 


2.6.59. Prove that one performance a night and never wants to repeat the 
n\2 snv2 n\2 on same set of jokes on any two nights, what is the minimum 
Gs) ( 1 ) Agere (") = ( " ) number of jokes he needs in his repertoire? 
(Hint: Rewrite the left-hand side as 2.6.63. Compare the coefficients of r* in (1+1)4(1+1)°= 
(1+1)4** to prove that 
n n n n n n 
OHO G" Jt G2) + k 
d e d+e 
and consider the problem of selecting a sample of n ys ( ) ( jes ) = ( z ) 
objects from an original set of 2n objects.) j=o \J J 


2.7 Combinatorial Probability 


In Section 2.6 our concern focused on counting the number of ways a given oper- 
ation, or sequence of operations, could be performed. In Section 2.7 we want to 
couple those enumeration results with the notion of probability. Putting the two 
together makes a lot of sense—there are many combinatorial problems where an 
enumeration, by itself, is not particularly relevant. A poker player, for example, is 
not interested in knowing the total number of ways he can draw to a straight; he is 
interested, though, in his probability of drawing to a straight. 

In a combinatorial setting, making the transition from an enumeration to a 
probability is easy. If there are n ways to perform a certain operation and a total 
of m of those satisfy some stated condition—call it A—then P(A) is defined to 
be the ratio m/n. This assumes, of course, that all possible outcomes are equally 
likely. 

Historically, the “m over n” idea is what motivated the early work of Pascal, 
Fermat, and Huygens (recall Section 1.3). Today we recognize that not all probabili- 
ties are so easily characterized. Nevertheless, the m/n model—the so-called classical 
definition of probability—is entirely appropriate for describing a wide variety of 


phenomena. 
Example An urn contains eight chips, numbered 1 through 8. A sample of three is drawn 
2.7.1 without replacement. What is the probability that the largest chip in the sample 
isa5? 


Let A be the event “Largest chip in sample is a 5.” Figure 2.7.1 shows what 
must happen in order for A to occur: (1) the 5 chip must be selected, and (2) two 
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chips must be drawn from the subpopulation of chips numbered 1 through 4. By the 
multiplication rule, the number of samples satisfying event A is the product () : (he 


Figure 2.7.1 


The sample space S for the experiment of drawing three chips from the urn 
contains C) outcomes, all equally likely. In this situation, then, m = (1) . ‘er n= 


(a): and 
(1) @) 


P(A)= © 
2 = 
=0.11 
Example An urn contains n red chips numbered 1 through n, n white chips numbered 1 


2.7.2 through n, and n blue chips numbered 1 through n (see Figure 2.7.2). Two chips 
are drawn at random and without replacement. What is the probability that the two 
drawn are either the same color or the same number? 


Draw two 


without 
replacement 


Figure 2.7.2 


Let A be the event that the two chips drawn are the same color; let B be the 
event that they have the same number. We are looking for P(A U B). 
Since A and B here are mutually exclusive, 


P(AUB)= P(A)+ P(B) 


With 3n chips in the urn, the total number of ways to draw an unordered sample of 


size 2 is (3 ji Moreover, 


P(A) = P(2reds U2 whites U 2 blues) 
= P(2reds) + P(2 whites) + P(2 blues) 


=3(5)/(3) 
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and 


P(B)= P(twol’sU two2’s U---Utwon’s) 


=n(3)/(2) 


Therefore, 


Example Twelve fair dice are rolled. What is the probability that 
2.7.3 


a. the first six dice all show one face and the last six dice all show a second face? 
b. not all the faces are the same? 
c. each face appears exactly twice? 


a. The sample space that corresponds to the “experiment” of rolling twelve dice 
is the set of ordered sequences of length twelve, where the outcome at every 
position in the sequence is one of the integers 1 through 6. If the dice are fair, 
all 6!” such sequences are equally likely. 

Let A be the set of rolls where the first six dice show one face and the second 
six show another face. Figure 2.7.3 shows one of the sequences in the event A. 
Clearly, the face that appears for the first half of the sequence could be any of 
the six integers from 1 through 6. 


Faces 
222 22 244 4 4 4 ~4 
[2°34 5 6 9 8 o tH 
Position in sequence 
Figure 2.7.3 


Five choices would be available for the last half of the sequence (since the two 
faces cannot be the same). The number of sequences in the event A, then, is 
6P, =6-5=30. Applying the “m/n” rule gives 


P(A) =30/6" =1.4 x 10-° 
b. Let B be the event that not all the faces are the same. Then 
P(B)=1-— P(B°) 
=1-6/12° 


since there are six sequences—(1, 1, 1,1,1,1,1,1,1,1,1,1,),..., (6, 6, 6, 6, 6, 6, 
6, 6, 6, 6, 6, 6,) where the twelve faces are all the same. 


Example 
2.7.4 


Example 
2.7.5 
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c. Let C be the event that each face appears exactly twice. From Theorem 2.6.2, the 
number of ways each face can appear exactly twice is 12!/(2!-2!-2!-2!-2!-2!). 
Therefore, 

12!/(2!-2!-2!-2!.2!-2!) 

62 
= 0.0034 a 


P(C)= 


A fair die is tossed n times. What is the probability that the sum of the faces showing 
isn+2? 

The sample space associated with rolling a die n times has 6” outcomes, all of 
which in this case are equally likely because the die is presumed fair. There are two 
“types” of outcomes that will produce a sum of n + 2: (a) n — 1 1’s and one 3 and (b) 
n—21’s and two 2’s (see Figure 2.7.4). By Theorem 2.6.2 the number of sequences 
having n — 1 1’s and one 3 is Wom =n; likewise, there are a = (5) outcomes 
having n — 2 1’s and two 2’s. Therefore, 


n 
P(sum=n+2)= 6 = 


Sum =n +2 Sum =n +2 
1 1 i 1 3 1 1 1 1 2 2 
1 2 3 n-1 n 1 2 3 n-2 n-1 n 
Figure 2.7.4 


Two monkeys, Mickey and Marian, are strolling along a moonlit beach when Mickey 
sees an abandoned Scrabble set. Investigating, he notices that some of the letters are 
missing, and what remain are the following fifty-nine: 


A BC 
4 1 2 


No 


E F G H I J K LM 
7 1 1 3 5 0 3 5 1 
N O PQ RR S TU VW XX Y Z 
3 2 00 2 8 4 2 0 1 0 2 «0 
Mickey, being of a romantic bent, would like to impress Marian, so he rearranges 
the letters in hopes of spelling out something clever. (Note: The rearranging is ran- 


dom because Mickey can’t spell; fortunately, Marian can’t read, so it really doesn’t 
matter.) What is the probability that Mickey gets lucky and spells out 


She walks in beauty, like the night 
Of cloudless climes and starry skies 


As we might imagine, Mickey would have to get very lucky. The total number of 
ways to permute fifty-nine letters—four A’s, one B, two C’s, and so on—is a direct 
application of Theorem 2.6.2: 

59! 
All!2! ... 210! 
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2.7.6 


But of that number of ways, only one is the couplet he is hoping for. So, since 
he is arranging the letters randomly, making all permutations equally likely, the 
probability of his spelling out Byron’s lines is 


1 
59! 
A4N1!2!... 20! 


or, using Stirling’s formula, about 1.7 x 10-*!. Love may conquer all, but it won’t 
beat those odds: Mickey would be well advised to start working on Plan B. a 


Suppose that k people are selected at random from the general population. What 
are the chances that at least two of those k were born on the same day? Known 
as the birthday problem, this is a particularly intriguing example of combinatorial 
probability because its statement is so simple, its analysis is straightforward, yet its 
solution, as we will see, is strongly contrary to our intuition. 

Picture the & individuals lined up in a row to form an ordered sequence. If leap 
year is omitted, each person might have any of 365 birthdays. By the multiplication 
rule, the group as a whole generates a sample space of 365‘ birthday sequences (see 
Figure 2.7.5). 


—+» 365* different 


Possible (365) (365)... (365) 
1 k sequences 


birthdays: 
irthdays 5 


+ 


Person 


Figure 2.7.5 


Define A to be the event “At least two people have the same birthday.” If each 
person is assumed to have the same chance of being born on any given day, the 365‘ 
sequences in Figure 2.7.5 are equally likely, and 


number of sequences in A 
365* 


P(A)= 


Counting the number of sequences in the numerator here is prohibitively dif- 
ficult because of the complexity of the event A; fortunately, counting the number 
of sequences in A‘ is quite easy. Notice that each birthday sequence in the sample 
space belongs to exactly one of two categories (see Figure 2.7.6): 


1. Atleast two people have the same birthday. 
2. Allk people have different birthdays. 


It follows that 


number of sequences in A = 365* — number of sequences where all k people 
have different birthdays 


The number of ways to form birthday sequences for k people subject to the 
restriction that all k birthdays must be different is simply the number of ways to 
form permutations of length k from a set of 365 distinct objects: 


365 Py = 365(364) - - - (365 —k + 1) 
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(July 13,Sept.2,. +++ ,July13) e- Sequences where at least 
(April 4, April 4, +++ ,Aug.17) °71 two people have the same 
: @ birthday 
Le 
a ee ee 
(June 14,Jan.10, +++ ,Oct.28) e¢ - Sequences where all k 
(Aug. 10,March 1,+ ++ ,Sept.8) people have different 


birthdays 


Sample space: all birthday sequences of 
length k (contains 365* outcomes). 


Figure 2.7.6 


Therefore, 


P(A) = P (Atleast two people have the same birthday) 


__ 365* — 365(364) --- (365 — k +1) 
7 365« 


Table 2.7.1 shows P(A) for k values of 15, 22, 23, 40, 50, and 70. Notice how the 
P(A)’s greatly exceed what our intuition would suggest. 


Comment Presidential biographies offer one opportunity to “confirm” the unex- 
pectedly large values that Table 2.7.1 gives for P(A). Among our first k = 40 
presidents, two did have the same birthday: Harding and Polk were both born on 
November 2. More surprising, though, are the death dates of the presidents: John 
Adams, Jefferson, and Monroe all died on July 4, and Fillmore and Taft both died 
on March 8. 


Table 2.7.1 


k  P(A)=P (at least two have same birthday) 


15 0.253 
22 0.476 
23 0.507 
40 0.891 
50 0.970 
70 0.999 


Comment The values for P(A) in Table 2.7.1 are actually slight underestimates for 
the true probabilities that at least two of k people will be born on the same day. The 
assumption made earlier that all 365“ birthday sequences are equally likely is not 
entirely correct: Births are somewhat more common during the summer than they 
are during the winter. It has been proven, though, that any sort of deviation from 
the equally likely model will serve only to increase the chances that two or more 
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people will share the same birthday (117). So, if k = 40, for example, the probability 
is slightly greater than 0.891 that at least two were born on the same day. 


One of the more instructive—and to some, one of the more useful—applications of 
combinatorics is the calculation of probabilities associated with various poker hands. 
It will be assumed in what follows that five cards are dealt from a poker deck and 
that no other cards are showing, although some may already have been dealt. The 
sample space is the set of (%) = 2,598,960 different hands, each having probability 
1/2,598,960. What are the chances of being dealt (a) a full house, (b) one pair, and 
(c) a straight? [Probabilities for the various other kinds of poker hands (two pairs, 


three-of-a-kind, flush, and so on) are gotten in much the same way.] 


a. Full house. A full house consists of three cards of one denomination and two 
of another. Figure 2.7.7 shows a full house consisting of three 7’s and two 
queens. Denominations for the three-of-a-kind can be chosen in ) ways. Then, 
given that a denomination has been decided on, the three requisite suits can 
be selected in (3) ways. Applying the same reasoning to the pair gives ('7) 

available denominations, each having (3) possible choices of suits. Thus, by the 


multiplication rule, 
@ (3) @ (:) 
T/\3/\ 1/72 
( 5 ?) = 0.00144 
5 


23 4 5 6 7 8 9 100 J Q K A 


P (full house) = 


D 

H x x 

Cc x 

S x x 
Figure 2.7.7 


b. One pair. To qualify as a one-pair hand, the five cards must include two of 
the same denomination and three “single” cards—cards whose denominations 
match neither the pair nor each other. Figure 2.7.8 shows a pair of 6’s. For 
the pair, there are ('’) possible denominations and, once selected, (3) possi- 


ble suits. Denominations for the three single cards can be chosen (‘) ways 
(see Question 2.7.16), and each card can have any of (7) suits. Multiplying these 
factors together and dividing by (2) gives a probability of 0.42: 


OMGQOOO _,. 
(5) 


P (one pair) = 
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23 4 5 6 7 8 9 10 J Q K A 


x x 


“nary 
x 
x 


Figure 2.7.8 


c. Straight. A straight is five cards having consecutive denominations but not all 
in the same suit—for example, a 4 of diamonds, 5 of hearts, 6 of hearts, 7 of 
clubs, and 8 of diamonds (see Figure 2.7.9). An ace may be counted “high” or 
“low,” which means that (10, jack, queen, king, ace) is a straight and so is (ace, 
2, 3, 4, 5). (If five consecutive cards are all in the same suit, the hand is called 
a straight flush. The latter is considered a fundamentally different type of hand 
in the sense that a straight flush “beats” a straight.) To get the numerator for 
P(straight), we will first ignore the condition that all five cards not be in the 
same suit and simply count the number of hands having consecutive denomina- 
tions. Note there are ten sets of consecutive denominations of length five: (ace, 
2, 3, 4,5), (2, 3, 4, 5, 6), ..., (10, jack, queen, king, ace). With no restrictions on 
the suits, each card can be either a diamond, heart, club, or spade. It follows, 
then, that the number of five-card hands having consecutive denominations is 


10- (). But forty (= 10-4) of those hands are straight flushes. Therefore, 


Ee 
10- © — 40 
—_—— =0. 2 

59 0.0039 

5 
Table 2.7.2 shows the probabilities associated with all the different poker hands. 
Hand i beats hand j if P(hand i) < P(hand /). 


P (straight) = 


D x x 
H x x 
C x 
S 
Figure 2.7.9 
Table 2.7.2 
Hand Probability 
One pair 0.42 
Two pairs 0.048 
Three-of-a-kind 0.021 
Straight 0.0039 
Flush 0.0020 
Full house 0.0014 
Four-of-a-kind 0.00024 
Straight flush 0.000014 
Royal flush 0.0000015 
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Problem-Solving Hints 


(Doing combinatorial probability problems) 


Listed on p. 72 are several hints that can be helpful in counting the number 
of ways to do something. Those same hints apply to the solution of combinato- 
rial probability problems, but a few others should be kept in mind as well. 


1. The solution to a combinatorial probability problem should be set up as a 
quotient of numerator and denominator enumerations. Avoid the tempta- 
tion to multiply probabilities associated with each position in the sequence. 
The latter approach will always “sound” reasonable, but it will frequently 
oversimplify the problem and give the wrong answer. 

2. Keep the numerator and denominator consistent with respect to order—if 
permutations are being counted in the numerator, be sure that permuta- 
tions are being counted in the denominator; likewise, if the outcomes in 
the numerator are combinations, the outcomes in the denominator must 
also be combinations. 

3. The number of outcomes associated with any problem involving the rolling 
of n six-sided dice is 6"; similarly, the number of outcomes associated 
with tossing a coin n times is 2”. The number of outcomes associated with 
dealing a hand of n cards from a standard 52-card poker deck is 52Cy,. 


Questions 


2.7.1. Ten equally qualified marketing assistants are can- 
didates for promotion to associate buyer; seven are men 
and three are women. If the company intends to promote 
four of the ten at random, what is the probability that 
exactly two of the four are women? 


2.7.2. An urn contains six chips, numbered 1 through 6. 
Two are chosen at random and their numbers are added 
together. What is the probability that the resulting sum is 
equal to 5? 


2.7.3. An urn contains twenty chips, numbered 1 through 
20. Two are drawn simultaneously. What is the probabil- 
ity that the numbers on the two chips will differ by more 
than 2? 


2.7.4. A bridge hand (thirteen cards) is dealt from a stan- 
dard 52-card deck. Let A be the event that the hand con- 
tains four aces; let B be the event that the hand contains 
four kings. Find P(A U B). 


2.7.5. Consider a set of ten urns, nine of which contain 
three white chips and three red chips each. The tenth con- 
tains five white chips and one red chip. An urn is picked at 
random. Then a sample of size 3 is drawn without replace- 
ment from that urn. If all three chips drawn are white, 
what is the probability that the urn being sampled is the 
one with five white chips? 


2.7.6. A committee of fifty politicians is to be chosen from 
among our one hundred U.S. senators. If the selection is 
done at random, what is the probability that each state 
will be represented? 


2.7.7. Suppose that n fair dice are rolled. What are the 
chances that all n faces will be the same? 


2.7.8. Five fair dice are rolled. What is the probability 
that the faces showing constitute a “full house” —that is, 
three faces show one number and two faces show a second 
number? 


2.7.9. Imagine that the test tube pictured contains 2n 
grains of sand, n white and n black. Suppose the tube is 
vigorously shaken. What is the probability that the two 
colors of sand will completely separate; that is, all of one 
color fall to the bottom, and all of the other color lie on 
top? (Hint: Consider the 2n grains to be aligned in a row. 
In how many ways can the n white and the n black grains 
be permuted?) 


“e() 
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2.7.10. Does a monkey have a better chance of 
rearranging 


ACCLLUUS tospell CALCULUS 


or 


AABEGLR tospell ALGEBRA? 


2.7.11. An apartment building has eight floors. If seven 
people get on the elevator on the first floor, what is the 
probability they all want to get off on different floors? On 
the same floor? What assumption are you making? Does 
it seem reasonable? Explain. 


2.7.12. If the letters in the phrase 
A ROLLING STONE GATHERS NO MOSS 


are arranged at random, what are the chances that not all 
the S’s will be adjacent? 


2.7.13. Suppose each of ten sticks is broken into a long 
part and a short part. The twenty parts are arranged into 
ten pairs and glued back together so that again there are 
ten sticks. What is the probability that each long part will 
be paired with a short part? (Note: This problem is a model 
for the effects of radiation on a living cell. Each chro- 
mosome, as a result of being struck by ionizing radiation, 
breaks into two parts, one part containing the centromere. 
The cell will die unless the fragment containing the cen- 
tromere recombines with a fragment not containing a 
centromere.) 


2.7.14. Six dice are rolled one time. What is the probabil- 
ity that each of the six faces appears? 


2.7.15. Suppose that a randomly selected group of k peo- 
ple are brought together. What is the probability that 
exactly one pair has the same birthday? 


2.7.16. For one-pair poker hands, why is the number of 
denominations for the three single cards () rather than 
MGIG)? 

1 1 Ly 
2.7.17. Dana is not the world’s best poker player. Dealt a 
2 of diamonds, an 8 of diamonds, an ace of hearts, an ace 


of clubs, and an ace of spades, she discards the three aces. 
What are her chances of drawing to a flush? 


2.7.18. A poker player is dealt a 7 of diamonds, a queen 
of diamonds, a queen of hearts, a queen of clubs, and an 
ace of hearts. He discards the 7. What is his probability of 
drawing to either a full house or four-of-a-kind? 


2.7.19. Tim is dealt a 4 of clubs, a 6 of hearts, an 8 of hearts, 
a 9 of hearts, and a king of diamonds. He discards the 4 
and the king. What are his chances of drawing to a straight 
flush? To a flush? 


2.7.20. Five cards are dealt from a standard 52-card deck. 
What is the probability that the sum of the faces on the 
five cards is 48 or more? 


2.7.21. Nine cards are dealt from a 52-card deck. Write 
a formula for the probability that three of the five even 
numerical denominations are represented twice, one of 
the three face cards appears twice, and a second face card 
appears once. (Note: Face cards are the jacks, queens, 
and kings; 2, 4, 6, 8, and 10 are the even numerical 
denominations.) 


2.7.22. A coke hand in bridge is one where none of the 
thirteen cards is an ace or is higher than a 9. What is the 
probability of being dealt such a hand? 


2.7.23. A pinochle deck has forty-eight cards, two of 
each of six denominations (9, J, Q, K, 10, A) and the 
usual four suits. Among the many hands that count for 
meld is a roundhouse, which occurs when a player has a 
king and queen of each suit. In a hand of twelve cards, 
what is the probability of getting a “bare” roundhouse 
(a king and queen of each suit and no other kings or 
queens)? 


2.7.24. A somewhat inebriated conventioneer finds him- 
self in the embarrassing predicament of being unable to 
predetermine whether his next step will be forward or 
backward. What is the probability that after hazarding n 
such maneuvers he will have stumbled forward a distance 
of r steps? (Hint: Let x denote the number of steps he 
takes forward and y, the number backward. Then x + y=n 
and x — y=r.) 


2.8 Taking a Second Look at Statistics (Monte Carlo 


Techniques) 


Recall the von Mises definition of probability given on p. 17. If an experiment is 
repeated n times under identical conditions, and if the event E occurs on m of those 


repetitions, then 


Pays dine (2.8.1) 
n>o n 
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Figure 2.8.1 


To be sure, Equation 2.8.1 is an asymptotic result, but it suggests an obvious (and 
very useful) approximation —if n is finite, 


mM 


In general, efforts to estimate probabilities by simulating repetitions of an 
experiment (usually with a computer) are referred to as Monte Carlo studies. Usu- 
ally the technique is used in situations where an exact probability is difficult to 
calculate. It can also be used, though, as an empirical justification for choosing one 
proposed solution over another. 

For example, consider the game described in Example 2.4.12 An urn contains a 
red chip, a blue chip, and a two-color chip (red on one side, blue on the other). One 
chip is drawn at random and placed on a table. The question is, if blue is showing, 
what is the probability that the color underneath is also blue? 

Pictured in Figure 2.8.1 are two ways of conceptualizing the question just posed. 
The outcomes in (a) are assuming that a chip was drawn. Starting with that premise, 
the answer to the question is !—the red chip is obviously eliminated and only one 


2 
of the two remaining chips is blue on both sides. 


Chip drawn Side drawn 
red red/red 
blue | P(BIB) = 1/2 sbi} > P(BIB) = 2/3 
two-color red/blue 
(a) (b) 
Table 2.8.1 
Trial# S U Triaal# S U Trial# S U Trial# S U 
1 R B 26 B R 51 B R 76 B- B* 
2; B B* 27 R R 52 R B T7 B- B* 
3 B R 28 R B 53 B- B* 78 R R 
4 R R 29 R B 54 R B 719 B B* 
5 R B 30 R R 55 R R 80 R R 
6 R B 31 R B 56 R B 81 R B 
7 R R 32 B- B* 57 R R 82 R B 
8 R R 33 R B 58 B B* 83 R R 
9 B- B* 34 B- B* 59 B R 84 B R 
10 B R 35 B B* 60 B- B* 85 B R 
11 R R 36 R R 61 B R 86 R R 
12 B- B* 37 B R 62 R B 87 B- B* 
13 R R 38 B B* 63 R R 88 R B 
14 B R 39 R R 64 R R 89 B R 
15 B- B* 40 B- B* 65 B- B* 90 R R 
16 B- B* 41 B B* 66 B R 91 R B 
17 R B 42 B R 67 R R 92 R R 
18 B R 43 B- B* 68 B- B* 93 R R 
19 B- B* 44 B- B* 69 B- B* 94 R B 
20 B- B* 45 B- B* 70 R R 95 B- B* 
21 R R 46 R R 71 R R 96 B- B* 
22 R R 47 B- B* 72 B- B* 97 B R 
23 B- B* 48 B- B* 73 R B 98 R R 
24 B R 49 R R 74 R R 99 B- B* 
25 B- B* 50 R R 75 B- B* 100 B- B* 
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By way of contrast, the outcomes in (b) are assuming that the side of a chip was 
drawn. If so, the blue color showing could be any of three blue sides, two of which 
are blue underneath. According to model (b), then, the probability of both sides 
being blue is z 

The formal analysis on p. 46, of course, resolves the debate—the correct answer 
is z But suppose that such a derivation was unavailable. How might we assess the 
relative plausibilities of 5 and =? The answer is simple—just play the game a num- 
ber of times and see what proportion of outcomes that show blue on top have blue 
underneath. 

To that end, Table 2.8.1 summarizes the results of one hundred random draw- 
ings. For a total of fifty-two, blue was showing (S) when the chip was placed on a 
table; for thirty-six of the trials (those marked with an asterisk), the color under- 
neath (U) was also blue. Using the approximation suggested by Equation 2.8.1, 


P (Blue is underneath | Blue is on top) = P(B | B) = a = 0.69 
a figure much more consistent with 5 than with 5. 

The point of this example is not to downgrade the importance of rigorous 
derivations and exact answers. Far from it. The application of Theorem 2.4.1 to 
solve the problem posed in Example 2.4.12 is obviously superior to the Monte Carlo 
approximation illustrated in Table 2.8.1. Still, replications of experiments can often 
provide valuable insights and call attention to nuances that might otherwise go 
unnoticed. As a problem-solving technique in probability and combinatorics, they 
are extremely important. 
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Appendix 3.A.1 Minitab Applications 


One of a Swiss family producing eight distinguished scientists, Jakob was forced by 
his father to pursue theological studies, but his love of mathematics eventually led 
him to a university career. He and his brother, Johann, were the most prominent 
champions of Leibniz’s calculus on continental Europe, the two using the new theory 
to solve numerous problems in physics and mathematics. Bernoulli’s main work in 
probability, Ars Conjectandi, was published after his death by his nephew, Nikolaus, 
in 1713. 

—Jakob (Jacques) Bernoulli (1654-1705) 


3.1 Introduction 


Throughout Chapter 2, probabilities were assigned to events—that is, to sets of 
sample outcomes. The events we dealt with were composed of either a finite or a 
countably infinite number of sample outcomes, in which case the event’s probabil- 
ity was simply the sum of the probabilities assigned to its outcomes. One particular 
probability function that came up over and over again in Chapter 2 was the assign- 
ment of + as the probability associated with each of the n points in a finite sample 
space. This is the model that typically describes games of chance (and all of our 
combinatorial probability problems in Chapter 2). 

The first objective of this chapter is to look at several other useful ways for 
assigning probabilities to sample outcomes. In so doing, we confront the desirability 
of “redefining” sample spaces using functions known as random variables. How and 
why these are used—and what their mathematical properties are— become the focus 
of virtually everything covered in Chapter 3. 

As a case in point, suppose a medical researcher is testing eight elderly adults 
for their allergic reaction (yes or no) to a new drug for controlling blood pressure. 
One of the 2° = 256 possible sample points would be the sequence (yes, no, no, yes, 
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no, no, yes, no), signifying that the first subject had an allergic reaction, the second 
did not, the third did not, and so on. Typically, in studies of this sort, the particular 
subjects experiencing reactions is of little interest: what does matter is the number 
who show a reaction. If that were true here, the outcome’s relevant information (i.e., 
the number of allergic reactions) could be summarized by the number 3.! 

Suppose X denotes the number of allergic reactions among a set of eight adults. 
Then X is said to be a random variable and the number 3 is the value of the random 
variable for the outcome (yes, no, no, yes, no, no, yes, no). 

In general, random variables are functions that associate numbers with some 
attribute of a sample outcome that is deemed to be especially important. If X 
denotes the random variable and s denotes a sample outcome, then X(s) =t, where 
t is areal number. For the allergy example, s = (yes, no, no, yes, no, no, yes, no) and 
t=3. 

Random variables can often create a dramatically simpler sample space. That 
certainly is the case here—the original sample space has 256 (= 2°) outcomes, each 
being an ordered sequence of length eight. The random variable X, on the other 
hand, has only nine possible values, the integers from 0 to 8, inclusive. 

In terms of their fundamental structure, all random variables fall into one of 
two broad categories, the distinction resting on the number of possible values the 
random variable can equal. If the latter is finite or countably infinite (which would 
be the case with the allergic reaction example), the random variable is said to be 
discrete; if the outcomes can be any real number in a given interval, the number of 
possibilities is uncountably infinite, and the random variable is said to be continuous. 
The difference between the two is critically important, as we will learn in the next 
several sections. 

The purpose of Chapter 3 is to introduce the important definitions, concepts, 
and computational techniques associated with random variables, both discrete and 
continuous. Taken together, these ideas form the bedrock of modern probability and 
statistics. 


3.2 Binomial and Hypergeometric Probabilities 


This section looks at two specific probability scenarios that are especially impor- 
tant, both for their theoretical implications as well as for their ability to describe 
real-world problems. What we learn in developing these two models will help us 
understand random variables in general, the formal discussion of which begins in 
Section 3.3. 


The Binomial Probability Distribution 


Binomial probabilities apply to situations involving a series of independent and 
identical trials, where each trial can have only one of two possible outcomes. Imag- 
ine three distinguishable coins being tossed, each having a probability p of coming 
up heads. The set of possible outcomes are the eight listed in Table 3.2.1. If the prob- 
ability of any of the coins coming up heads is p, then the probability of the sequence 
(H, H, H) is p?, since the coin tosses qualify as independent trials. Similarly, the 


! By Theorem 2.6.2, of course, there would be a total of fifty-six (= 8!/3!5!) outcomes having exactly three yeses. 
All fifty-six would be equivalent in terms of what they imply about the drug’s likelihood of causing allergic 
reactions. 
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Theorem 
3.2.1 


probability of (T, H, H) is (1 — p)p?. The fourth column of Table 3.2.1 shows the 
probabilities associated with each of the three-coin sequences. 


Table 3.2.1 
1st Coin 2nd Coin 3rd Coin Probability Number of Heads 
H H H Dp 3 
H T p-(1— p) 2 
H T H p’(1— p) 2 
T H H p’(— p) 2 
H T T p(l— p) 1 
T H T p(i— py’ 1 
T T H p(l— p) 1 
T T T (l— py 0 


Suppose our main interest in the coin tosses is the number of heads that occur. 
Whether the actual sequence is, say, (H, H, T) or (H, T, H) is immaterial, since 
each outcome contains exactly two heads. The last column of Table 3.2.1 shows the 
number of heads in each of the eight possible outcomes. Notice that there are three 
outcomes with exactly two heads, each having an individual probability of p?(1 — p). 
The probability, then, of the event “two heads” is the sum of those three individual 
probabilities — that is, 3p?(1 — p). Table 3.2.2 lists the probabilities of tossing k heads, 
where k =O, 1, 2, or 3. 


Table 3.2.2 
Number of Heads Probability 


0 (1— py 

1 3p(1— p) 
2 3p*(1— p) 
3 Dp 


Now, more generally, suppose that n coins are tossed, in which case the number 
of heads can equal any integer from 0 through n. By analogy, 


probability of 
number of any particular sequence 
P(kheads)= | ways to arrange k having k heads 


heads and n — k tails and n — k tails 
number of ways 
to arrange k -p*(1— py" 
heads and n — k tails 


The number of ways to arrange k H’s and n —k T’s, though, is nC or (7) (recall 
Theorem 2.6.2). 


Consider a series of n independent trials, each resulting in one of two possible out- 
comes, “success” or “failure.” Let p= P (success occurs at any given trial) and assume 
that p remains constant from trial to trial. Then 


P(k successes) = () p’d— p)"*, k=0,1,...,n 


Example 
3.2.1 


Example 
3.2.2 
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Comment The probability assignment given by the equation in Theorem 3.2.1 is 
known as the binomial distribution. 


An information technology center uses nine aging disk drives for storage. The prob- 
ability that any one of them is out of service is 0.06. For the center to function 
properly, at least seven of the drives must be available. What is the probability that 
the computing center can get its work done? 

The probability that a drive is available is p = 1 — 0.06 = 0.94. Assuming the 
devices operate independently, the number of disk drives available has a binomial 
distribution with n = 9 and p=0.94. The probability that at least seven disk drives 
work is a reassuring 0.986: 


(3) (0.94)7 (0.06)? + (;) (0.94)8 (0.06)! + (7) (0.94)°(0.06)° = 0.986 


Kingwest Pharmaceuticals is experimenting with a new affordable AIDS medi- 
cation, PM-17, that may have the ability to strengthen a victim’s immune sys- 
tem. Thirty monkeys infected with the HIV complex have been given the drug. 
Researchers intend to wait six weeks and then count the number of animals whose 
immunological responses show a marked improvement. Any inexpensive drug capa- 
ble of being effective 60% of the time would be considered a major breakthrough; 
medications whose chances of success are 50% or less are not likely to have any 
commercial potential. 

Yet to be finalized are guidelines for interpreting results. Kingwest hopes to 
avoid making either of two errors: (1) rejecting a drug that would ultimately prove 
to be marketable and (2) spending additional development dollars on a drug whose 
effectiveness, in the long run, would be 50% or less. As a tentative “decision rule,” 
the project manager suggests that unless sixteen or more of the monkeys show 
improvement, research on PM-17 should be discontinued. 


a. What are the chances that the “sixteen or more” rule will cause the company to 
reject PM-17, even if the drug is 60% effective? 

b. How often will the “sixteen or more” rule allow a 50%-effective drug to be 
perceived as a major breakthrough? 


(a) Each of the monkeys is one of n = 30 independent trials, where the out- 
come is either a “success” (Monkey’s immune system is strengthened) or a 
“failure” (Monkey’s immune system is not strengthened). By assumption, 
the probability that PM-17 produces an immunological improvement in any 
given monkey is p= P (success) = 0.60. 

By Theorem 3.2.1, the probability that exactly k monkeys (out of thirty) 


will show improvement after six weeks is k (0.60)* (0.40)°°-*. The prob- 


ability, then, that the “sixteen or more” rule will cause a 60%-effective drug 
to be discarded is the sum of “binomial” probabilities for k values ranging 
from 0 to 15: 

15 30 
P(60%-effective drug fails “sixteen or more” rule) = > ( Z ) (0.60)* (0.40)0-* 


k=0 


= 0.1754 
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Example 


3.2.3 


Roughly 18% of the time, in other words, a “breakthrough” drug such as 
PM-17 will produce test results so mediocre (as measured by the “sixteen 
or more” rule) that the company will be misled into thinking it has no 
potential. 

(b) The other error Kingwest can make is to conclude that PM-17 warrants 
further study when, in fact, its value for p is below a marketable level. The 
chance that particular incorrect inference will be drawn here is the proba- 
bility that the number of successes will be greater than or equal to sixteen 
when p =0.5. That is, 


P(50%-effective PM-17 appears to be marketable) 


= P(Sixteen or more successes occur) 


30 
= > () (0.5;°05)* 


k=16 
= 0.43 


Thus, even if PM-17’s success rate is an unacceptably low 50%, it has a 43% 


chance of performing sufficiently well in thirty trials to satisfy the “sixteen 
or more” criterion. 


Comment Evaluating binomial summations can be tedious, even with a calcula- 
tor. Statistical software packages offer a convenient alternative. Appendix 3.A.1 
describes how one such program, Minitab, can be used to answer the sorts of 
questions posed in Example 3.2.2.  ] 


The Stanley Cup playoff in professional hockey is a seven-game series, where the 
first team to win four games is declared the champion. The series, then, can last any- 
where from four to seven games (just like the World Series in baseball). Calculate 
the likelihoods that the series will last four, five, six, or seven games. Assume that 
(1) each game is an independent event and (2) the two teams are evenly matched. 

Consider the case where Team A wins the series in six games. For that to happen, 
they must win exactly three of the first five games and they must win the sixth game. 
Because of the independence assumption, we can write 


P(Team A wins in six games) = P(Team A wins three of first five) - 


P(Team A wins sixth) 
= (3) 050.5) - (0.5) = 0.15625 


Since the probability that Team B wins the series in six games is the same (why?), 


P (Series ends in six games) = P(Team A wins in six games U 
Team B wins in six games) 
= P(A wins in six) + P(B wins in six) (why?) 
= 0.15625 + 0.15625 
= 0.3125 


Example 
3.2.4 
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A similar argument allows us to calculate the probabilties of four-, five-, and seven- 
game series: 


P (four-game series) = 2(0.5)* =0.125 
P (five-game series) = 2 (;) 0.503) (0.5) =0.25 
P (seven-game series) = 2 (3) (0.5)° 05) | (0.5) = 0.3125 


Having calculated the “theoretical” probabilities associated with the possible 
lengths of a Stanley Cup playoff raises an obvious question: How do those likeli- 
hoods compare with the actual distribution of playoff lengths? Between 1947 and 
2006 there were sixty playoffs (the 2004-05 season was cancelled). Column 2 in 
Table 3.2.3 shows the proportion of playoffs that have lasted four, five, six, and seven 
games, respectively. 


Table 3.2.3 


Series Length Observed Proportion ‘Theoretical Probability 


4 17/60 = 0.283 0.125 
5 15/60 =0.250 0.250 
6 16/60 = 0.267 0.3125 
7 12/60 =0.200 0.3125 


Source: statshockey.homestead.com/stanleycup.html 


Clearly, the agreement between the entries in Columns 2 and 3 is not very good: 
Particularly noticeable is the excess of short playoffs (four games) and the deficit 
of long playoffs (seven games). What this “lack of fit” suggests is that one or more 
of the binomial distribution assumptions is not satisfied. Consider, for example, the 
parameter p, which we assumed to equal ‘. In reality, its value might be something 
quite different—just because the teams playing for the championship won their 
respective divisions, it does not necessarily follow that the two are equally good. 
Indeed, if the two contending teams were frequently mismatched, the consequence 
would be an increase in the number of short playoffs and a decrease in the num- 
ber of long playoffs. It may also be the case that momentum is a factor in a team’s 
chances of winning a given game. If so, the independence assumption implicit in the 
binomial model is rendered invalid. = 


The junior mathematics class at Superior High School knows that the probability of 
making a 600 or greater on the SAT Reasoning Test in Mathematics is 0.231, while 
the similar probability for the Critical Reading Test is 0.191. The math students issue 
a challenge to their math-averse classmates. Each group will select four students and 
have them take the respective test. The mathematics students will win the challenge 
if more of their members exceed 600 on the mathematics test than do the other 
students on the Critical Reading Test. What is the probability that the mathematics 
students win the challenge? 

Let M denote the number of mathematics scores of 600 or more and CR 
denote the similar number for the critical reading testees. In this notation, a typical 
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combination in which the mathematics class wins is CR = 2, M =3. The probability 


of this combination is 


P(CR =2, M =3) = P(CR=2)P(M =3) 


because events involving CR and M are independent. But 


4 2 2 4 3 1 
P(CR=2)- P(M=3)= (3) o.190 (0.809) }-[G)o23 (0.769) 


= (0.143) (0.038) = 0.0054 


Table 3.2.4 below lists all of these joint probabilities to four decimal places for the 
various values of CR and M. The shaded cells are those where mathematics wins the 


challenge. 
Table 3.2.4 
M 

CR i : 
0 0.1498 0.1800 
a 0.1415 0.1700 
2 0.0501 0.0602 
3 0.0079 0.0095 
4 0.0005 0.0006 


2 3 4 
0.0811 0.0162 0.0012 
0.0766 0.0153 0.0012 
0.0271 | 0.0054 0.0004 
0.0043 0.0009 0.0001 
0.0003 0.0001 0.0000 


The sum of the probabilities in the cells is 0.3775. 
The moral of the story is that the mathematics students need to study more 


probability. 


Questions 


3.2.1. An investment analyst has tracked a certain blue- 
chip stock for the past six months and found that on any 
given day, it either goes up a point or goes down a point. 
Furthermore, it went up on 25% of the days and down 
on 75%. What is the probability that at the close of trad- 
ing four days from now, the price of the stock will be the 
same as it is today? Assume that the daily fluctuations are 
independent events. 


3.2.2. In a nuclear reactor, the fission process is con- 
trolled by inserting special rods into the radioactive core 
to absorb neutrons and slow down the nuclear chain reac- 
tion. When functioning properly, these rods serve as a 
first-line defense against a core meltdown. Suppose a reac- 
tor has ten control rods, each operating independently and 
each having an 0.80 probability of being properly inserted 
in the event of an “incident.” Furthermore, suppose that 


a meltdown will be prevented if at least half the rods 
perform satisfactorily. What is the probability that, upon 
demand, the system will fail? 


3.2.3. In 2009 a donor who insisted on anonymity 
gave seven-figure donations to twelve universities. A 
media report of this generous but somewhat mysteri- 
ous act identified that all of the universities awarded 
had female presidents. It went on to say that with about 
23% of U.S. college presidents being women, the prob- 
ability of a dozen randomly selected institutions having 
female presidents is about 1/50,000,000. Is this probability 
approximately correct? 


3.2.4. An entrepreneur owns six corporations, each with 
more than $10 million in assets. The entrepreneur consults 
the U.S. Internal Revenue Data Book and discovers that 
the IRS audits 15.3% of businesses of that size. What is 


the probability that two or more of these businesses will 
be audited? 


3.2.5. The probability is 0.10 that ball bearings in a 
machine component will fail under certain adverse condi- 
tions of load and temperature. If a component containing 
eleven ball bearings must have a least eight of them 
functioning to operate under the adverse conditions, what 
is the probability that it will break down? 


3.2.6. Suppose that since the early 1950s some 
ten-thousand independent UFO sightings have been 
reported to civil authorities. If the probability that any 
sighting is genuine is on the order of one in one hundred 
thousand, what is the probability that at least one of the 
ten-thousand was genuine? 


3.2.7. Doomsday Airlines (“Come Take the Flight of 
Your Life”) has two dilapidated airplanes, one with two 
engines, and the other with four. Each plane will land 
safely only if at least half of its engines are working. Each 
engine on each aircraft operates independently and each 
has probability p = 0.4 of failing. Assuming you wish to 
maximize your survival probability, which plane should 
you fly on? 


3.2.8. Two lighting systems are being proposed for an 
employee work area. One requires fifty bulbs, each hav- 
ing a probability of 0.05 of burning out within a month’s 
time. The second has one hundred bulbs, each with a 
0.02 burnout probability. Whichever system is installed 
will be inspected once a month for the purpose of replac- 
ing burned-out bulbs. Which system is likely to require 
less maintenance? Answer the question by comparing the 
probabilities that each will require at least one bulb to be 
replaced at the end of thirty days. 


3.2.9. The great English diarist Samuel Pepys asked his 
friend Sir Isaac Newton the following question: Is it more 
likely to get at least one 6 when six dice are rolled, at 
least two 6’s when twelve dice are rolled, or at least three 
6’s when eighteen dice are rolled? After considerable cor- 
respondence [see (158)]. Newton convinced the skeptical 
Pepys that the first event is the most likely. Compute the 
three probabilities. 


3.2.10. The gunner on a small assault boat fires six mis- 
siles at an attacking plane. Each has a 20% chance of being 
on-target. If two or more of the shells find their mark, the 
plane will crash. At the same time, the pilot of the plane 
fires ten air-to-surface rockets, each of which has a 0.05 
chance of critically disabling the boat. Would you rather 
be on the plane or the boat? 


3.2.11. If a family has four children, is it more likely 
they will have two boys and two girls or three of one sex 
and one of the other? Assume that the probability of a 
child being a boy is ; and that the births are independent 
events. 
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3.2.12. Experience has shown that only 4 of all patients 
having a certain disease will recover if given the standard 
treatment. A new drug is to be tested on a group of twelve 
volunteers. If the FDA requires that at least seven of these 
patients recover before it will license the new drug, what is 
the probability that the treatment will be discredited even 
if it has the potential to increase an individual’s recovery 
rate to }? 


3.2.13. Transportation to school for a rural county’s 
seventy-six children is provided by a fleet of four buses. 
Drivers are chosen on a day-to-day basis and come from 
a pool of local farmers who have agreed to be “on call.” 
What is the smallest number of drivers who need to be in 
the pool if the county wants to have at least a 95% proba- 
bility on any given day that all the buses will run? Assume 
that each driver has an 80% chance of being available if 
contacted. 


3.2.14. The captain of a Navy gunboat orders a vol- 
ley of twenty-five missiles to be fired at random along 
a five-hundred-foot stretch of shoreline that he hopes 
to establish as a beachhead. Dug into the beach is a 
thirty-foot-long bunker serving as the enemy’s first line 
of defense. The captain has reason to believe that the 
bunker will be destroyed if at least three of the missiles 
are on-target. What is the probability of that happening? 


3.2.15. A computer has generated seven random num- 
bers over the interval 0 to 1. Is it more likely that (a) 
exactly three will be in the interval ; to 1 or (b) fewer than 
three will be greater than 2? 


3.2.16. Listed in the following table is the length distri- 
bution of World Series competition for the 58 series from 
1950 to 2008 (there was no series in 1994). 


World Series Lengths 


Number of Games, X Number of Years 


NANA 
PR 
i=) 


Source: espn.go.com/mlb/worldseries/history/winners 


Assuming that each World Series game is an independent 
event and that the probability of either team’s winning 
any particular contest is 0.5, find the probability of each 
series length. How well does the model fit the data? 
(Compute the “expected” frequencies, that is, multiply the 
probability of a given-length series times 58). 
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3.2.17. Use the expansion of (x + y)” (recall the com- 
ment in Section 2.6 on p. 67) to verify that the binomial 


probabilities sum to 1; that is, > (") p*(1 — p)""* = 1. 


k=0 


3.2.18. Suppose a series of n independent trials can 
end in one of three possible outcomes. Let k,; and k, 
denote the number of trials that result in outcomes 1 
and 2, respectively. Let p,; and p, denote the proba- 
bilities associated with outcomes 1 and 2. Generalize 
Theorem 3.2.1 to deduce a formula for the probability 


of getting k, and k, occurrences of outcomes 1 and 2, 
respectively. 


3.2.19. Repair calls for central air conditioners fall into 
three general categories: coolant leakage, compressor 
failure, and electrical malfunction. Experience has shown 
that the probabilities associated with the three are 0.5, 0.3, 
and 0.2, respectively. Suppose that a dispatcher has logged 
in ten service requests for tomorrow morning. Use the 
answer to Question 3.2.18 to calculate the probability that 
three of those ten will involve coolant leakage and five will 
be compressor failures. 


The Hypergeometric Distribution 


The second “special” distribution that we want to look at formalizes the urn prob- 
lems that frequented Chapter 2. Our solutions to those earlier problems tended to 
be enumerations in which we listed the entire set of possible samples, and then 
counted the ones that satisfied the event in question. The inefficiency and redun- 
dancy of that approach should now be painfully obvious. What we are seeking here 
is a general formula that can be applied to any and all such problems, much like the 
expression in Theorem 3.2.1 can handle the full range of questions arising from the 


binomial model. 


Suppose an urn contains r red chips and w white chips, where r + w = N. Imag- 
ine drawing n chips from the urn one at a time without replacing any of the chips 
selected. At each drawing we record the color of the chip removed. The question 
is, what is the probability that exactly k red chips are included among the n that are 


removed? 


Notice that the experiment just described is similar in some respects to the bino- 
mial model, but the method of sampling creates a critical distinction. [f each chip 
drawn was replaced prior to making another selection, then each drawing would be 
an independent trial, the chances of drawing a red in any given trial would be a con- 
stant r/N, and the probability that exactly k red chips would ultimately be included 
in the n selections would be a direct application of Theorem 3.2.1: 


P(kreds drawn) = (‘) (r/N)K—r/Ny"*, k=0,1,2,...,n 


However, if the chips drawn are not replaced, then the probability of drawing a 
red on any given attempt is not necessarily r/N: Its value would depend on the col- 
ors of the chips selected earlier. Since p = P(Red is drawn) = P(success) does not 
remain constant from drawing to drawing, the binomial model of Theorem 3.2.1 
does not apply. Instead, probabilities that arise from the “no replacement” scenario 
just described are said to follow the hypergeometric distribution. 


Theorem 
3.2.2 


chips selected, then 


P(k red chips are chosen) = 


Suppose an urn contains r red chips and w white chips, where r +w=N. If n chips 
are drawn out at random, without replacement, and if k denotes the number of red 


(i) (na) 


(7) 


(3.2.1) 
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where k varies over all the integers for which (;) and (ee) are defined. The prob- 
abilities appearing on the right-hand side of Equation 3.2.1 are known as the 
hypergeometric distribution. 


Proof Assume the chips are distinguishable. We need to count the number of ele- 
ments making up the event of getting k red chips and n — k white chips. The number 
of ways to select the red chips, regardless of the order in which they are chosen, is 
,P,. Similarly, the number of ways to select the n — k white chips is ,, P,-~. However, 
the order in which the white chips are selected does matter. Each outcome is an 
n-long ordered sequence of red and white. There are (7) ways to choose where in 
the sequence the red chips go. Thus, the number of elements in the event of inter- 
est is ae: w Pn—z. Now, the total number of ways to choose n elements from N, in 
order, without replacement is y P,, so 


(i) Pr wPr—k 
nPn 


This quantity, while correct, is not in the form of the statement of the theorem. 
To make that conversion, we have to change all of the terms in the expression to 
factorials: 


P(k red chips are chosen) = 


tr Pk w Pa 
P (k red chips are chosen) = ()rPew Pre 


NPn 

n! r! w! 
_ kla—b! ¢—b! w—n th! 
a NI 

(N—n)! 

r! w! 
_ RGB! @—blw—nth! — WG) 
7 Oe: = 3) 

n!(N —n)! 


Comment The appearance of binomial coefficients suggests a model of selecting 
unordered subsets. Indeed, one can consider the model of selecting a subset of size 
n simultaneously, where order doesn’t matter. In that case, the question remains: 
What is the probability of getting k red chips and n —k white chips? A moment’s 
reflection will show that the hypergeometric probabilities given in the statement 
of the theorem also answer that question. So, if our interest is simply counting the 
number of red and white chips in the sample, the probabilities are the same whether 
the drawing of the sample is simultaneous or the chips are drawn in order without 
repetition. 


Comment The name hypergeometric derives from a series introduced by the Swiss 
mathematician and physicist Leonhard Euler, in 1769: 


ab a(a+1)b(b+ I) 2 a(a+1)(a+2)b(b+ (b+ 2) 3 dnt 
2!c(c+ 1) 3!le(c + 1)(c +2) 

This is an expansion of considerable flexibility: Given appropriate values for a, b, 

and c, it reduces to many of the standard infinite series used in analysis. In particular, 

if a is set equal to 1, and b and c are set equal to each other, it reduces to the familiar 

geometric series, 


1+ 


ltxtx?$rxP t+ 
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3.2.5 


Example 


3.2.6 


hence the name hypergeometric. The relationship of the probability function in The- 
orem 3.2.2 to Euler’s series becomes apparent if we seta=—n,b=-r,c=w—n+tl, 
and multiply the series by (“)/(“). Then the coefficient of x will be 


(1) (0) 
G) 


the value the theorem gives for P(k red chips are chosen). 


A hung jury is one that is unable to reach a unanimous decision. Suppose that a 
pool of twenty-five potential jurors is assigned to a murder case where the evidence 
is so overwhelming against the defendant that twenty-three of the twenty-five would 
return a guilty verdict. The other two potential jurors would vote to acquit regardless 
of the facts. What is the probability that a twelve-member panel chosen at random 
from the pool of twenty-five will be unable to reach a unanimous decision? 

Think of the jury pool as an urn containing twenty-five chips, twenty-three of 
which correspond to jurors who would vote “guilty” and two of which correspond 
to jurors who would vote “not guilty.” If either or both of the jurors who would 
vote “not guilty” are included in the panel of twelve, the result would be a hung 
jury. Applying Theorem 3.2.2 (twice) gives 0.74 as the probability that the jury 
impanelled would not reach a unanimous decision: 


P (Hung jury) = P (Decision is not unanimous) 


OVO 


The Florida Lottery features a number of games of chance, one of which is called 
Fantasy Five. The player chooses five numbers from a card containing the numbers 1 
through 36. Each day five numbers are chosen at random, and if the player matches 
all five, the winnings can be as much as $200,000 for a $1 bet. 

Lottery games like this one have spawned a mini-industry looking for biases 
in the selection of the winning numbers. Websites post various “analyses” claiming 
certain numbers are “hot” and should be played. One such examination focused 
on the frequency of winning numbers between 1 and 12. The probability of such 
occurrences fits the hypergeometric distribution, where r = 12, w = 24, n =5, and 
N = 36. For example, the probability that three of the five numbers are 12 or less is 


(3) @) 60,720 


(>) ~ 376,992 


=0.161 


Notice how that compares to the observed proportion of drawings with exactly three 
numbers between 1 and 12. Of the 2008 daily drawings—366 of them—there were 
sixty-five with three numbers 12 or less, giving a relative frequency of 65/366 =0.178. 

The full breakdown of observed and expected probabilities for winning num- 
bers between 1 and 12 is given in Table 3.2.5. 

The naive or dishonest commentator might claim that the lottery “likes” num- 
bers < 12 since the proportion of tickets drawn with three, four, or five numbers < 
12 is 


0.178 + 0.038 + 0.005 = 0.221 
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Table 3.2.5 


No. Drawn <12 Observed Proportion Hypergeometric Probability 


0 0.128 0.113 
1 0.372 0.338 
2 0.279 0.354 
3 0.178 0.161 
4 0.038 0.032 
5 0.005 0.002 


Source: www.flalottery.com/exptkt/ff.html 


This figure is in excess of the sum of the hypergeometric probabilities for k = 3, 
4, and 5: 


0.161 + 0.032 + 0.002 = 0.195 


However, we shall see in Chapter 10 that such variation is well within the random 
fluctuations expected for truly random drawings. No bias can be inferred from these 
results. = 


When a bullet is fired it becomes scored with minute striations produced by imper- 
fections in the gun barrel. Appearing as a series of parallel lines, these striations 
have long been recognized as a basis for matching a bullet with a gun, since repeated 
firings of the same weapon will produce bullets having substantially the same con- 
figuration of markings. Until recently, deciding how close two patterns had to be 
before it could be concluded the bullets came from the same weapon was largely 
subjective. A ballistics expert would simply look at the two bullets under a micro- 
scope and make an informed judgment based on past experience. Today, however, 
criminologists are beginning to address the problem more quantitatively, partly with 
the help of the hypergeometric distribution. 

Suppose a bullet is recovered from the scene of a crime, along with the sus- 
pect’s gun. Under a microscope, a grid of m cells, numbered 1 to m, is superimposed 
over the bullet. If m is chosen large enough that the width of the cells is sufficiently 
small, each of that evidence bullet’s n, striations will fall into a different cell (see 
Figure 3.2.1a). Then the suspect’s gun is fired, yielding a test bullet, which will have 
a total of n, striations located in a possibly different set of cells (see Figure 3.2.1b). 
How might we assess the similarities in cell locations for the two striation patterns? 

As a model for the striation pattern on the evidence bullet, imagine an urn con- 
taining m chips, with n, corresponding to the striation locations. Now, think of the 
striation pattern on the test bullet as representing a sample of size n, from the evi- 
dence urn. By Theorem 3.2.2, the probability that k of the cell locations will be 
shared by the two striation patterns is 


Ne\ (M—Ne 
eae, 
m 
) 
Suppose the bullet found at a murder scene is superimposed with a grid having 
m = 25 cells, n. of which contain striations. The suspect’s gun is fired and the bul- 
let is found to have n,; = 3 striations, one of which matches the location of one of 


the striations on the evidence bullet. What do you think a ballistics expert would 
conclude? 
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Striations (total of n,) 


| 
12 3 4°55 m 
(a) 


Evidence bullet 


Striations (total of n,) 


iL De 


123 4 5 m 
(b) 


Test bullet 


Figure 3.2.1 


Intuitively, the similarity between the two bullets would be reflected in the prob- 
ability that one or more striations in the suspect’s bullet match the evidence bullet. 
The smaller that probability is, the stronger would be our belief that the two bullets 
were fired by the same gun. Based on the values given for m, n., and n,, 


(3)(o) 
(3) 


()G) 
(3) 


()G) 
(3) 


= 0.42 


P (one or more matches) = + + 


If P(one or more matches) had been a very small number—say, 0.001 —the 
inference would have been clear-cut: The same gun fired both bullets. But, here 
with the probability of one or more matches being so large, we cannot rule out the 
possibility that the bullets were fired by two different guns (and, presumably, by two 
different people). = 


A tax collector, finding himself short of funds, delayed depositing a large property 
tax payment ten different times. The money was subsequently repaid, and the whole 
amount deposited in the proper account. The tip-off to this behavior was the delay 
of the deposit. During the period of these irregularities, there was a total of 470 tax 
collections. 

An auditing firm was preparing to do a routine annual audit of these transac- 
tions. They decided to randomly sample nineteen of the collections (approximately 
4%) of the payments. The auditors would assume a pattern of malfeasance only if 
they saw three or more irregularities. What is the probability that three or more of 
the delayed deposits would be chosen in this sample? 

This kind of audit sampling can be considered a hypergeometric experiment. 
Here, N = 470, n= 19, r=10, and w = 460. In this case it is better to calculate the 
desired probability via the complement —that is, 


,_() (9) (G8) @) Gr) 
(3) (8) GS) 
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The calculation of the first hypergeometric term is 


10\ ( 460 
0 19 460! 19!451 451 450 442 


ie ~ "191441! 470! —«470-«469 461 


To compute hypergeometric probabilities where the numbers are large, a useful 
device is a recursion formula. To that end, note that the ratio of the k + 1 term to the 


k term is 
Ci) (iter) GG) a-k ork 
(* : () k+l wo-n+k+l1 
(See Question 3.2.30.) 
Therefore, 
) (a) 19+0 10-0 
sed SEL EO bse — — 0.2834 
tp) 1+0 460—-19+0+1 
and 
(°) i) 19-1 10-1 
Se Ale See, ). 1 pened = = 0.0518 
ie) 1+1 460—19+1+1 


The desired probability, then, is 1 — 0.6592 — 0.2834 — 0.0518 = 0.0056, which shows 
that a larger audit sample would be necessary to have a reasonable chance of 
detecting this sort of impropriety. = 


Case Study 3.2.1 


Biting into a plump, juicy apple is one of the innocent pleasures of autumn. 
Critical to that enjoyment is the firmness of the apple, a property that growers 
and shippers monitor closely. The apple industry goes so far as to set a lowest 
acceptable limit for firmness, which is measured (in lbs) by inserting a probe 
into the apple. For the Red Delicious variety, for example, firmness is supposed 
to be at least 12 lbs; in the state of Washington, wholesalers are not allowed to 
sell apples if more than 10% of their shipment falls below that 12-Ib limit. 

All of this raises an obvious question: How can shippers demonstrate that 
their apples meet the 10% standard? Testing each one is not an option— 
the probe that measures firmness renders an apple unfit for sale. That leaves 
sampling as the only viable strategy. 

Suppose, for example, a shipper has a supply of 144 apples. She decides 
to select 15 at random and measure each one’s firmness, with the intention of 
selling the remaining apples if 2 or fewer in the sample are substandard. What 
are the consequences of her plan? More specifically, does it have a good chance 
of “accepting” a shipment that meets the 10% rule and “rejecting” one that does 
not? (If either or both of those objectives are not met, the plan is inappropriate.) 

For example, suppose there are actually 10 defective apples among the 
original 144. Since in x 100 = 6.9%, that shipment would be suitable for sale 
because fewer than 10% failed to meet the firmness standard. The question is, 


(Continued on next page) 
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(Case Study 3.2.1 continued) 


how likely is it that a sample of 15 chosen at random from that shipment will 
pass inspection? 

Notice, here, that the number of substandard apples in the sample has a 
hypergeometric distribution with r = 10, w= 134,n=15, and N= 144. Therefore, 


P(Sample passes inspection) = P(2 or fewer substandard apples are found) 


es) C028). C03) 
= ey ee 


= 0.320 + 0.401 + 0.208 = 0.929 


So, the probability is reassuringly high that a supply of apples this good would, 
in fact, be judged acceptable to ship. Of course, it also follows from this cal- 
culation that roughly 7% of the time, the number of substandard apples found 
will be greater than 2, in which case the apples would be (incorrectly) assumed 
to be unsuitable for sale (earning them an undeserved one-way ticket to the 
applesauce factory ... ). 

How good is the proposed sampling plan at recognizing apples that would, 
in fact, be inappropriate to ship? Suppose, for example, that 30, or 21%, of the 
144 apples would fall below the 12-lb limit. Ideally, the probability here that a 
sample passes inspection should be small. The number of substandard apples 
found in this case would be hypergeometric with r = 30, w = 114, n = 15, and 


N= 144, so 
OG. OC), 68 


(Qs) Gs) Gs) 


= 0.024 + 0.110 + 0.221 = 0.355 


P(Sample passes inspection) = 


Here the bad news is that the sampling plan will allow a 21% defective supply 
to be shipped 36% of the time. The good news is that 64% of the time, the 
number of substandard apples in the sample will exceed 2, meaning that the 
correct decision “not to ship” will be made. 

Figure 3.2.2 shows P(Sample passes) plotted against the percentage of 
defectives in the entire supply. Graphs of this sort are called operating char- 
acteristic (or OC) curves: They summarize how a sampling plan will respond to 
all possible levels of quality. 


1 


0.8 


P (Sample passes) 
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Presumed percent defective 


Figure 3.2.2 
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Comment Every sampling plan invariably allows for two kinds of errors— 
rejecting shipments that should be accepted and accepting shipments that 
should be rejected. In practice, the probabilities of committing these errors can 
be manipulated by redefining the decision rule and/or changing the sample size. 
Some of these options will be explored in Chapter 6. 


Questions 


3.2.20. A corporate board contains twelve members. 
The board decides to create a five-person Committee 
to Hide Corporation Debt. Suppose four members of 
the board are accountants. What is the probability that 
the Committee will contain two accountants and three 
nonaccountants? 


3.2.21. One of the popular tourist attractions in Alaska 
is watching black bears catch salmon swimming upstream 
to spawn. Not all “black” bears are black, though— 
some are tan-colored. Suppose that six black bears and 
three tan-colored bears are working the rapids of a 
salmon stream. Over the course of an hour, six different 
bears are sighted. What is the probability that those six 
include at least twice as many black bears as tan-colored 
bears? 


3.2.22. A city has 4050 children under the age of ten, 
including 514 who have not been vaccinated for measles. 
Sixty-five of the city’s children are enrolled in the ABC 
Day Care Center. Suppose the municipal health depart- 
ment sends a doctor and a nurse to ABC to immunize any 
child who has not already been vaccinated. Find a formula 
for the probability that exactly k of the children at ABC 
have not been vaccinated. 


3.2.23. Country A_ inadvertently launches ten 
guided missiles—six armed with nuclear warheads—at 
Country B. In response, Country B fires seven antiballis- 
tic missiles, each of which will destroy exactly one of the 
incoming rockets. The antiballistic missiles have no way 
of detecting, though, which of the ten rockets are carrying 
nuclear warheads. What are the chances that Country B 
will be hit by at least one nuclear missile? 


3.2.24. Anne is studying for a history exam covering the 
French Revolution that will consist of five essay ques- 
tions selected at random from a list of ten the professor 
has handed out to the class in advance. Not exactly a 
Napoleon buff, Anne would like to avoid researching all 
ten questions but still be reasonably assured of getting a 
fairly good grade. Specifically, she wants to have at least 
an 85% chance of getting at least four of the five ques- 
tions right. Will it be sufficient if she studies eight of the 
ten questions? 


3.2.25. Each year a college awards five merit-based schol- 
arships to members of the entering freshman class who 
have exceptional high school records. The initial pool 
of applicants for the upcoming academic year has been 
reduced to a “short list” of eight men and ten women, all 
of whom seem equally deserving. If the awards are made 
at random from among the eighteen finalists, what are the 
chances that both men and women will be represented? 


3.2.26. Keno is a casino game in which the player has 
a card with the numbers 1 through 80 on it. The player 
selects a set of k numbers from the card, where k can 
range from one to fifteen. The “caller” announces twenty 
winning numbers, chosen at random from the eighty. The 
amount won depends on how many of the called numbers 
match those the player chose. Suppose the player picks ten 
numbers. What is the probability that among those ten are 
six winning numbers? 


3.2.27. A display case contains thirty-five gems, of which 
ten are real diamonds and twenty-five are fake diamonds. 
A burglar removes four gems at random, one at a time 
and without replacement. What is the probability that the 
last gem she steals is the second real diamond in the set of 
four? 


3.2.28. A bleary-eyed student awakens one morning, late 
for an 8:00 class, and pulls two socks out of a drawer that 
contains two black, six brown, and two blue socks, all ran- 
domly arranged. What is the probability that the two he 
draws are a matched pair? 


3.2.29. Show directly that the set of probabilities associ- 
ated with the hypergeometric distribution sum to 1. (Hint: 
Expand the identity 


(+)*=(+p)'+u)"" 
and equate coefficients.) 


3.2.30. Show that the ratio of two successive hypergeo- 
metric probability terms satisfies the following equation, 


Gary) : G) (°.) ak r—k 


(*) () kL wantk+l 


for any k where both numerators are defined. 
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3.2.31. Urn I contains five red chips and four white chips; 
urn IJ contains four red and five white chips. Two chips are 
drawn simultaneously from urn I and placed into urn II. 
Then a single chip is drawn from urn II. What is the prob- 
ability that the chip drawn from urn II is white? (Hint: Use 
Theorem 2.4.1.) 


3.2.32. As the owner of a chain of sporting goods stores, 
you have just been offered a “deal” on a shipment of 
one hundred robot table tennis machines. The price is 
right, but the prospect of picking up the merchandise at 
midnight from an unmarked van parked on the side of 
the New Jersey Turnpike is a bit disconcerting. Being of 
low repute yourself, you do not consider the legality of 
the transaction to be an issue, but you do have concerns 
about being cheated. If too many of the machines are 
in poor working order, the offer ceases to be a bargain. 
Suppose you decide to close the deal only if a sample of 
ten machines contains no more than one defective. Con- 
struct the corresponding operating characteristic curve. 
For approximately what incoming quality will you accept 
a shipment 50% of the time? 


3.2.33. Suppose that r of N chips are red. Divide the chips 
into three groups of sizes n,,n2, and n3, where n,; +n) + 
n3=N. Generalize the hypergeometric distribution to find 
the probability that the first group contains r, red chips, 


the second group r, red chips, and the third group 7; red 
chips, where r; +r. +73 =r. 


3.2.34. Some nomadic tribes, when faced with a life- 
threatening, contagious disease, try to improve their 
chances of survival by dispersing into smaller groups. Sup- 
pose a tribe of twenty-one people, of whom four are 
carriers of the disease, split into three groups of seven 
each. What is the probability that at least one group is 
free of the disease? (Hint: Find the probability of the 
complement.) 


3.2.35. Suppose a population contains n, objects of one 
kind, n. objects of a second kind, ..., and n, objects of a 
tth kind, where n; +n,+---+n,=N.A sample of size n 
is drawn at random and without replacement. Deduce an 
expression for the probability of drawing k, objects of the 
first kind, k, objects of the second kind,..., and k, objects 
of the rth kind by generalizing Theorem 3.2.2. 


3.2.36. Sixteen students—five freshmen, four sopho- 
mores, four juniors, and three seniors—have applied for 
membership in their school’s Communications Board, 
a group that oversees the college’s newspaper, literary 
magazine, and radio show. Eight positions are open. If 
the selection is done at random, what is the probability 
that each class gets two representatives? (Hint: Use the 
generalized hypergeometric model asked for in Ques- 
tion 3.2.35.) 


3.3 Discrete Random Variables 


The binomial and hypergeometric distributions described in Section 3.2 are special 
cases of some important general concepts that we want to explore more fully in this 
section. Previously in Chapter 2, we studied in depth the situation where every point 
in a sample space is equally likely to occur (recall Section 2.6). The sample space of 
independent trials that ultimately led to the binomial distribution presented a quite 
different scenario: specifically, individual points in S had different probabilities. For 
example, ifn =4 and p= }, the probabilities assigned to the sample points (s, f, s, f) 
and (f, f, f, f) are (1/3)?(2/3)? = & and (2/3)* = }, respectively. Allowing for the 
possibility that different outcomes may have different probabilities will obviously 
broaden enormously the range of real-world problems that probability models can 
address. 

How to assign probabilities to outcomes that are not binomial or hypergeomet- 
ric is one of the major questions investigated in this chapter. A second critical issue 
is the nature of the sample space itself and whether it makes sense to redefine the 
outcomes and create, in effect, an alternative sample space. Why we would want to 
do that has already come up in our discussion of independent trials. The “original” 
sample space in such cases is a set of ordered sequences, where the ith member of a 
sequence is either an “s” or an “ f,” depending on whether the ith trial ended in suc- 
cess or failure, respectively. However, knowing which particular trials ended in suc- 
cess is typically less important than knowing the number that did (recall the medical 
researcher discussion on p. 102). That being the case, it often makes sense to replace 
each ordered sequence with the number of successes that sequence contains. Doing 
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so collapses the original set of 2” ordered sequences (i.e., outcomes) in S to the set 
of n+ 1 integers ranging from 0 to n. The probabilities assigned to those integers, of 
course, are given by the binomial formula in Theorem 3.2.1. 

In general, a function that assigns numbers to outcomes is called a random vari- 
able. The purpose of such functions in practice is to define a new sample space 
whose outcomes speak more directly to the objectives of the experiment. That 
was the rationale that ultimately motivated both the binomial and hypergeometric 
distributions. 

The purpose of this section is to (1) outline the general conditions under which 
probabilities can be assigned to sample spaces and (2) explore the ways and means 
of redefining sample spaces through the use of random variables. The notation 
introduced in this section is especially important and will be used throughout the 
remainder of the book. 


Assigning Probabilities: The Discrete Case 


We begin with the general problem of assigning probabilities to sample outcomes, 
the simplest version of which occurs when the number of points in S is either finite 
or countably infinite. The probability functions, p(s), that we are looking for in those 
cases satisfy the conditions in Definition 3.3.1. 


Definition 3.3.1. Suppose that S is a finite or countably infinite sample space. 
Let p be a real-valued function defined for each element of S such that 


a. 0< p(s) foreachseS 


b. S> p(s)=1 


ses 


Then p is said to be a discrete probability function. 


Comment Once p(s) is defined for all s, it follows that the probability of any event 
A—that is, P(A) —is the sum of the probabilities of the outcomes comprising A: 


P(A)=)~ p(s) (3.3.1) 


seEA 


Defined in this way, the function P(A) satisfies the probability axioms given in 
Section 2.3. The next several examples illustrate some of the specific forms that p(s) 
can have and how P(A) is calculated. 


Ace-six flats are a type of crooked dice where the cube is foreshortened in the one- 
six direction, the effect being that 1’s and 6’s are more likely to occur than any of 
the other four faces. Let p(s) denote the probability that the face showing is s. For 
many ace-six flats, the “cube” is asymmetric to the extent that p(1) = p(6) = i while 
P(2)= p(3)= p(4)= pS)= i. Notice that p(s) here qualifies as a discrete probability 
function because each p(s) is greater than or equal to 0 and the sum of p(s), over 


all s, is 1[ =2($) + 4(4)]. 
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3.3.2 


Suppose A is the event that an even number occurs. It follows from Equa- 
tion 3.3.1 that P(A) = P(2)+ P(4)+ P@)=;+34+4=5- 


Comment If two ace-six flats are rolled, the probability of getting a sum equal to 7 
is equal to 2p(1) p(6) + 2p(2)p(5) + 2p(3) p(4) = 2(4)° + 4(4)’ = 2. If two fair dice 
are rolled, the probability of getting a sum equal to 7 is 2p(1) p(6) + 2p(2) p(5) + 
2p(3)p(4 = 6(2)° = i, which is less than 4. Gamblers cheat with ace-six flats by 
switching back and forth between fair dice and ace-six flats, depending on whether 
or not they want a sum of 7 to be rolled. = 


Suppose a fair coin is tossed until a head comes up for the first time. What are the 
chances of that happening on an odd-numbered toss? 

Note that the sample space here is countably infinite and so is the set of out- 
comes making up the event whose probability we are trying to find. The P(A) that 
we are looking for, then, will be the sum of an infinite number of terms. 

Let p(s) be the probability that the first head appears on the sth toss. Since 
the coin is presumed to be fair, p(1) = 5. Furthermore, we would expect that half 
the time, when a tail appears, the next toss would be a head, so p(2)= 4-4 = 4. In 
general, p(s) =(4)’,s=1,2,.... 

Does p(s) satisfy the conditions stated in Definition 3.3.1? Yes. Clearly, p(s) >0 
for all s. To see that the sum of the probabilities is 1, recall the formula for the sum 
of a geometric series: If0<r <1, 


Sra = (3.3.2) 


s=0 
Applying Equation 3.3.2 to the sample space here confirms that P(S) = 1: 
P=) 6-5 : => : _ 2 = _ -l=1 
7 s=l . 7 s=1 2 . s=0 2 2 7 2 7 


Now, let A be the event that the first head appears on an odd-numbered toss. 
Then P(A) = p(1) + p(3)+ pG)+--- But 


ee Ory 2s+1 WW a/1\° 
pO) + PB)+ B)+--=Y pes+=> (5) =)2Q) 
s=0 


s=0 s=0 


=G)L/(-a) 3 : 


Case Study 3.3.1 


For good pedagogical reasons, the principles of probability are always intro- 
duced by considering events defined on familiar sample spaces generated by 
simple experiments. To that end, we toss coins, deal cards, roll dice, and draw 
chips from urns. It would be a serious error, though, to infer that the impor- 
tance of probability extends no further than the nearest casino. In its infancy, 


(Continued on next page) 
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gambling and probability were, indeed, intimately related: Questions arising 
from games of chance were often the catalyst that motivated mathematicians 
to study random phenomena in earnest. But more than 340 years have passed 
since Huygens published De Ratiociniis. Today, the application of probability 
to gambling is relatively insignificant (the NCAA March basketball tournament 
notwithstanding) compared to the depth and breadth of uses the subject finds 
in business, medicine, engineering, and science. 

Probability functions— properly chosen—can “model” complex real-world 
phenomena every bit as well as P(heads) = 5 describes the behavior of a fair 
coin. The following set of actuarial data is a case in point. Over a period of 
three years (= 1096 days) in London, records showed that a total of 903 deaths 
occurred among males eighty-five years of age and older (180). Columns 1 and 
2 of Table 3.3.1 give the breakdown of those 903 deaths according to the num- 
ber occurring on a given day. Column 3 gives the proportion of days for which 
exactly s elderly men died. 


Table 3.3.1 
(1) (2) (3) (4) 
Numberof Numberof Proportion p(s) 
Deaths, s Days [= Col.(2)/1096] 
0 484 0.442 0.440 
1 391 0.357 0.361 
2 164 0.150 0.148 
3 45 0.041 0.040 
4 11 0.010 0.008 
5 1 0.001 0.003 
6+ 0 0.000 0.000 
1096 1 1 


For reasons that we will go into at length in Chapter 4, the probability 
function that describes the behavior of this particular phenomenon is 


p(s) = P(s elderly men die ona given day) 
e~982(0,82)5 


s! 


s=0,1,2,... (3.3.3) 


How do we know that the p(s) in Equation 3.3.3 is an appropriate way to assign 
probabilities to the “experiment” of elderly men dying? Because it accurately 
predicts what happened. Column 4 of Table 3.3.1 shows p(s) evaluated for s = 
0, 1,2,.... To two decimal places, the agreement between the entries in Column 
3 and Column 4 is perfect. 


Consider the following experiment: Every day for the next month you copy down 
each number that appears in the stories on the front pages of your hometown news- 
paper. Those numbers would necessarily be extremely diverse: One might be the 
age of a celebrity who had just died, another might report the interest rate currently 
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paid on government Treasury bills, and still another might give the number of square 
feet of retail space recently added to a local shopping mall. 

Suppose you then calculated the proportion of those numbers whose leading 
digit was a 1, the proportion whose leading digit was a 2, and so on. What relation- 
ship would you expect those proportions to have? Would numbers starting with a 2, 
for example, occur as often as numbers starting with a 6? 

Let p(s) denote the probability that the first significant digit of a “newspaper 
number” is s,s =1,2,...,9. Our intuition is likely to tell us that the nine first digits 
should be equally probable —that is, p(1) = p(2)=---= pQ) = 5: Given the diversity 
and the randomness of the numbers, there is no obvious reason why one digit should 
be more common than another. Our intuition, though, would be wrong —first digits 
are not equally likely. Indeed, they are not even close to being equally likely! 

Credit for making this remarkable discovery goes to Simon Newcomb, a math- 
ematician who observed more than a hundred years ago that some portions of 
logarithm tables are used more than others (78). Specifically, pages at the begin- 
ning of such tables are more dog-eared than pages at the end, suggesting that users 
have more occasion to look up logs of numbers starting with small digits than they 
do numbers starting with large digits. 

Almost fifty years later, a physicist, Frank Benford, reexamined Newcomb’s 
claim in more detail and looked for a mathematical explanation. What is now known 
as Benford’s law asserts that the first digits of many different types of measurements, 
or combinations of measurements, often follow the discrete probability model: 


1 
p(s) = P(ist significant digit is s) = log (1 + -) , s=1,2,...,9 
Ss 


Table 3.3.2 compares Benford’s law to the uniform assumption that p(s) = 3, for 
all s. The differences are striking. According to Benford’s law, for example, 1’s are 
the most frequently occurring first digit, appearing 6.5 times (=0.301/0.046) as often 


as 9’s. 

Table 3.3.2 

s “Uniform” Law Benford’s Law 
1 0.111 0.301 
2 0.111 0.176 
3 0.111 0.125 
4 0.111 0.097 
5 0.111 0.079 
6 0.111 0.067 
7 0.111 0.058 
8 0.111 0.051 
9 0.111 0.046 


Comment A key to why Benford’s law is true is the differences in proportional 
changes associated with each leading digit. To go from one thousand to two thou- 
sand, for example, represents a 100% increase; to go from eight thousand to nine 
thousand, on the other hand, is only a 12.5% increase. That would suggest that 
evolutionary phenomena such as stock prices would be more likely to start with 
1’s and 2’s than with 8’s and 9’s—and they are. Still, the precise conditions under 
which p(s) = log (1 + 7) ,s=1,2,...,9, are not fully understood and remain a topic 
of research. | 
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1 a\ 
Se (a) eee SU 
Pts) aa () ' 7 


a discrete probability function? Why or why not? 

To qualify as a discrete probability function, a given p(s) needs to satisfy parts 
(a) and (b) of Definition 3.3.1. A simple inspection shows that part (a) is satisfied. 
Since A > 0, p(s) is, in fact, greater than or equal to 0 for all s =0,1,2,.... Part (b) 
is satisfied if the sum of all the probabilities defined on the outcomes in S is 1. But 


ia | 1 
Bol Bern cacy, 


allseS s=0 
= ( : (why?) 
1+A\1- ax 
1 1+A 
1+. 1 
=1 


The answer, then, is “yes” — p(s) = 4; (4). s=0,1,2,...;4>0 does qualify as 
a discrete probability function. Of course, whether it has any practical value depends 
on whether the set of values for p(s) actually do describe the behavior of real-world 


phenomena. = 


Defining “New” Sample Spaces 


We have seen how the function p(s) associates a probability with each outcome, s, 
in a sample space. Related is the key idea that outcomes can often be grouped or 
reconfigured in ways that may facilitate problem solving. Recall the sample space 
associated with a series of n independent trials, where each s is an ordered sequence 
of successes and failures. The most relevant information in such outcomes is often 
the number of successes that occur, not a detailed listing of which trials ended in 
success and which ended in failure. That being the case, it makes sense to define 
a “new” sample space by grouping the original outcomes according to the num- 
ber of successes they contained. The outcome (f, f, ..., f), for example, had 
0 successes. On the other hand, there were n outcomes that yielded 1 success— 
(s, f, f,---, f), Ff, 8, f, ---, f), -.-, and (f, f, ..., 5). As we saw earlier in 
this chapter, that particular regrouping of outcomes ultimately led to the binomial 
distribution. 

The function that replaces the outcome (s, f, f,..., f) with the numerical value 
1 is called a random variable. We conclude this section with a discussion of some of 
the concepts, terminology, and applications associated with random variables. 


Definition 3.3.2. A function whose domain is a sample space S and whose 
values form a finite or countably infinite set of real numbers is called a dis- 
crete random variable. We denote random variables by uppercase letters, often 
Xory. 
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Example 


3.3.5 


Consider tossing two dice, an experiment for which the sample space is a set of 
ordered pairs, S={(, j)|i=1,2,...,6; 7 =1,2,..., 6}. For a variety of games rang- 
ing from Monopoly to craps, the sum of the numbers showing is what matters on 
a given turn. That being the case, the original sample space S of thirty-six ordered 
pairs would not provide a particularly convenient backdrop for discussing the rules 
of those games. It would be better to work directly with the sums. Of course, the 
eleven possible sums (from 2 to 12) are simply the different values of the random 
variable X, where X(i, j)=i+j. 


Comment In the above example, suppose we define a random variable X, that gives 
the result on the first die and a random variable X>2 that gives the result on the 
second die. Then X = X; + X2. Note how easily we could extend this idea to the 
toss of three dice, or ten dice. The ability to conveniently express complex events in 
terms of simpler ones is an advantage of the random variable concept that we will 
see playing out over and over again. = 


The Probability Density Function 


We began this section discussing the function p(s), which assigns a probability to 
each outcome s in S. Now, having introduced the notion of a random variable X as 
a real-valued function defined on S—that is, X(s) = k—we need to find a mapping 
analogous to p(s) that assigns probabilities to the different values of k. 


Definition 3.3.3. Associated with every discrete random variable X is a 
probability density function (or pdf ), denoted px (k), where 


px(k) = P({s € S| X(s)=k}) 


Note that p(k) =0 for any k not in the range of X. For notational simplicity, 
we will usually delete all references to s and S and write px(k) = P(X =k). 


Comment We have already discussed at length two examples of the function px (k). 
Recall the binomial distribution derived in Section 3.2. If we let the random vari- 
able X denote the number of successes in n independent trials, then Theorem 3.2.1 
states that 


n k n—k 
PX=K)=pxlld= (7) (l—p)"™“, k=0,1,...,n 
A similar result was given in that same section in connection with the hyper- 
geometric distribution. If a sample of size n is drawn without replacement from 


an urn containing r red chips and w white chips, and if we let the random vari- 
able X denote the number of red chips included in the sample, then (according to 


Theorem 3.2.2), 
Param oxt= (Nana) fC) 
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Consider again the rolling of two dice as described in Example 3.3.5. Let i and j 
denote the faces showing on the first and second die, respectively, and define the 
random variable X to be the sum of the two faces: X (i, j) =i + j. Find px(k). 

According to Definition 3.3.3, each value of px (k) is the sum of the probabilities 
of the outcomes that get mapped by X onto the value k. For example, 


P(X =5)= px(5) = P({s € S| X(s) =5)) 
= P((1,4), 4, D, @, 3), G, 2)] 
= P(1,4)+ P(@, 1)+ P(2,3) + P@, 2) 
1 1 1 1 


“36° 36° 36" 36 


_4 
~ 36 


assuming the dice are fair. Values of px(k) for other k are calculated similarly. 
Table 3.3.3 shows the random variable’s entire pdf. 


Table 3.3.3 
px(k) k  px(k) 


k 
2 1/36 8 5/36 
3 2/36 9 4/36 
4 3/36 10 3/36 
5 
6 


4/36 11 2/36 
5/36 12 1/36 
6/36 = 


Acme Industries typically produces three electric power generators a day; some 
pass the company’s quality-control inspection on their first try and are ready to be 
shipped; others need to be retooled. The probability of a generator needing further 
work is 0.05. If a generator is ready to ship, the firm earns a profit of $10,000. If 
it needs to be retooled, it ultimately costs the firm $2,000. Let X be the random 
variable quantifying the company’s daily profit. Find py (k). 

The underlying sample space here is a set of n = 3 independent trials, where 
p = P(Generator passes inspection) = 0.95. If the random variable X is to measure 
the company’s daily profit, then 


X = $10,000 x (no. of generators passing inspection) 


— $2,000 x (no. of generators needing retooling) 


For instance, X(s, f, s) = 2($10,000) — 1($2,000) = $18,000. Moreover, the random 
variable X equals $18,000 whenever the day’s output consists of two successes and 
one failure. That is, X(s, f,s) =X(s,s, f)=X(f,5, 5). It follows that 


3 
P(X = $18,000) = px (18,000) = (>) (0.95)(0.05)! = 0.135375 


Table 3.3.4 shows px (k) for the four possible values of k ($30,000, $18,000, $6,000, 
and —$6,000). 
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Example 


3.3.8 


Table 3.3.4 


No. Defectives k =Profit Px(k) 


0 $30,000 0.857375 
1 $18,000 0.135375 
2 $6,000 0.007125 
3 —$6,000 0.000125 


As part of her warm-up drill, each player on State’s basketball team is required to 
shoot free throws until two baskets are made. If Rhonda has a 65% success rate at 
the foul line, what is the pdf of the random variable X that describes the number of 
throws it takes her to complete the drill? Assume that individual throws constitute 
independent events. 

Figure 3.3.1 illustrates what must occur if the drill is to end on the kth toss, 
k=2,3,4,...: First, Rhonda needs to make exactly one basket sometime during the 
first k — 1 attempts, and, second, she needs to make a basket on the kth toss. Written 
formally, 


px(k) = P(X =k) = P(Drillends on kth throw) 
= P[(1 basket and k — 2 misses in first k — 1 throws) N (basket on kth throw) ] 
= P(1 basket and k — 2 misses) - P (basket) 


Exactly one basket 


Miss Basket Miss Miss Basket 
1 2 3 k-1 k 
Attempts 
Figure 3.3.1 


Notice that k — 1 different sequences have the property that exactly one of the 
first k — 1 throws results in a basket: 


B M M M M 
1 2 3 4 kK-1 
M B M M MM 
k-1 I 2 3 4 kl 
sequences : 
M M M M BL 
1 2 3 4 k-1 


Since each sequence has probability (0.35)*~?(0.65), 
P(1 basket and k — 2 misses) = (k — 1)(0.35)*~? (0.65) 
Therefore, 
px (k) = (k — 1)(0.35)*-* (0.65) - (0.65) 
= (k — 1)(0.35)* 70.65)’, k=2,3,4,... (3.3.4) 


Table 3.3.5 shows the pdf evaluated for specific values of k. Although the range of k 
is infinite, the bulk of the probability associated with X is concentrated in the values 
2 through 7: It is highly unlikely, for example, that Rhonda would need more than 
seven shots to complete the drill. 
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Table 3.3.5 


k Px(k) 


2 = 0.4225 
3 0.2958 
4 = 0.1553 
2] 0.0725 
6 = 0.0317 
7 ~~ 0.0133 
8+ 0.0089 


The Cumulative Distribution Function 


In working with random variables, we frequently need to calculate the probability 
that the value of a random variable is somewhere between two numbers. For exam- 
ple, suppose we have an integer-valued random variable. We might want to calculate 
an expression like P(s < X <tr). If we know the pdf for X, then 


t 
P(s<X<1)=) > px(k) 
k=s 
But depending on the nature of px(k) and the number of terms that need to be 
added, calculating the sum of px(k) from k=s to k=t may be quite difficult. An 
alternate strategy is to use the fact that 


P(s<X <t)=P(X <t)— P(X <s—1) 


where the two probabilities on the right represent cumulative probabilities of the 

random variable X. If the latter were available (and they often are), then evaluating 

P(s < X <t) by one simple subtraction would clearly be easier than doing all the 
t 


calculations implicit in }* p x(k). 


=s 


Definition 3.3.4. Let X be a discrete random variable. For any real number f, 
the probability that X takes on a value <t is the cumulative distribution function 
(cdf) of X [written Fy(r)]. In formal notation, Fy(t) = P({s € S| X(s) <t}). As 
was the case with pdfs, references to s and S are typically deleted, and the cdf is 
written Fy(t) = P(X <t). 


Suppose we wish to compute P(21 < X < 40) for a binomial random variable X 
with n = 50 and p =0.6. From Theorem 3.2.1, we know the formula for px(k), so 
P(21 < X <40) can be written as a simple, although computationally cumbersome, 
sum: 
4050 
P(21<X <40)= 6)* (0.4) * 
(21<X<40)=)° (jo (0.4) 
k=21 

Equivalently, the probability we are looking for can be expressed as the difference 
between two cdfs: 


P(21<X <40)= P(X < 40) — P(X < 20) = Fx (40) — Fx (20) 
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As it turns out, values of the cdf for a binomial random variable are widely available, 
both in books and in computer software. Here, for example, Fx (40) = 0.9992 and 
Fy (20) = 0.0034, so 


P(21< X < 40) =0.9992 — 0.0034 


= 0.9958 = 
Example Suppose that two fair dice are rolled. Let the random variable X denote the larger 
3.3.10 of the two faces showing: (a) Find Fy(t) for t=1,2,...,6 and (b) Find Fy(2.5). 


a. The sample space associated with the experiment of rolling two fair dice is the 
set of ordered pairs s = (i, j), where the face showing on the first die is i and the 
face showing on the second die is 7. By assumption, all thirty-six possible out- 
comes are equally likely. Now, suppose ¢ is some integer from 1 to 6, inclusive. 


Then 


Fx(t)= P(X S10) 


= P[Max (7, j) St] 


=PiGi<t and j<tr) (why?) 
=P <t)-P(j St) (why?) 
t ¢ 

=." 

2 

= 5e t=1,2,3,4,5,6 


b. Even though the random variable X has nonzero probability only for the inte- 
gers 1 through 6, the cdf is defined for any real number from —oo to +00. By 
definition, Fy (2.5) = P(X < 2.5). But 


P(X <2.5)= P(X <2)4+ PQ<X <2.5) 


= Fx(2)+0 
so 
Fx (2.5) = Fx (2) = 2 = 
36 «9 
What would the graph of F(t) as a function of t look like? a 


Questions 


3.3.1. An urn contains five balls numbered 1 to 5. Two 
balls are drawn simultaneously. 


(a) Let X be the larger of the two numbers drawn. Find 
px(k). 

(b) Let V be the sum of the two numbers drawn. Find 
py (k). 


3.3.2. Repeat Question 3.3.1 for the case where the two 
balls are drawn with replacement. 


3.3.3. Suppose a fair die is tossed three times. Let X be 
the largest of the three faces that appear. Find py (k). 


3.3.4. Suppose a fair die is tossed three times. Let X be the 
number of different faces that appear (so X = 1, 2, or 3). 


3.3.5. A fair coin is tossed three times. Let X be the num- 
ber of heads in the tosses minus the number of tails. Find 
Px(k). 


3.3.6. Suppose die one has spots 1, 2, 2, 3, 3, 4 and die two 
has spots 1, 3, 4,5, 6, 8. If both dice are rolled, what is the 
sample space? Let X = total spots showing. Show that the 
pdf for X is the same as for normal dice. 


3.3.7. Suppose a particle moves along the x-axis begin- 
ning at 0. It moves one integer step to the left or right with 
equal probability. What is the pdf of its position after four 
steps? 


3.3.8. How would the pdf asked for in Question 3.3.7 be 
affected if the particle was twice as likely to move to the 
right as to the left? 


3.3.9. Suppose that five people, including you and a 
friend, line up at random. Let the random variable X 
denote the number of people standing between you and 
your friend. What is px (k)? 


3.3.10. Urn I and urn II each have two red chips and 
two white chips. Two chips are drawn simultaneously from 
each urn. Let X, be the number of red chips in the first 
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sample and X, the number of red chips in the second 
sample. Find the pdf of X, + X2. 


3.3.11. Suppose X is a binomial random variable with 
n=4and p= <. What is the pdf of 2X + 1? 


3.3.12. Find the cdf for the random variable X in Ques- 
tion 3.3.3. 


3.3.13. A fair die is rolled four times. Let the random vari- 
able X denote the number of 6’s that appear. Find and 
graph the cdf for X. 


3.3.14. At the points x =0,1,...,6, the cdf for the dis- 
crete random variable X has the value Fy (x)= x(x + 1)/42. 
Find the pdf for X. 


3.3.15. Find the pdf for the discrete random variable X 
whose cdf at the points x =0,1,...,6 is given by Fy (x) = 
x7/216. 


3.4 Continuous Random Variables 


The statement was made in Chapter 2 that all sample spaces belong to one of two 
generic types—discrete sample spaces are ones that contain a finite or a countably 
infinite number of outcomes and continuous sample spaces are those that contain an 
uncountably infinite number of outcomes. Rolling a pair of dice and recording the 
faces that appear is an experiment with a discrete sample space; choosing a number 
at random from the interval [0, 1] would have a continuous sample space. 

How we assign probabilities to these two types of sample spaces is different. 
Section 3.3 focused on discrete sample spaces. Each outcome s is assigned a prob- 
ability by the discrete probability function p(s). If a random variable X is defined 
on the sample space, the probabilities associated with its outcomes are assigned by 
the probability density function px(k). Applying those same definitions, though, to 
the outcomes in a continuous sample space will not work. The fact that a continuous 
sample space has an uncountably infinite number of outcomes eliminates the option 
of assigning a probability to each point as we did in the discrete case with the func- 
tion p(s). We begin this section with a particular pdf defined on a discrete sample 
space that suggests how we might define probabilities, in general, on a continuous 
sample space. 

Suppose an electronic surveillance monitor is turned on briefly at the beginning 
of every hour and has a 0.905 probability of working properly, regardless of how 
long it has remained in service. If we let the random variable X denote the hour at 
which the monitor first fails, then px (k) is the product of k individual probabilities: 


Px(k) = P(X =k) = P(Monitor fails for the first time at the kth hour) 
= P(Monitor functions properly for first k — 1 hours MN Monitor fails at the kth hour) 
= (0.905)*"'(0.095),  k=1,2,3,... 


Figure 3.4.1 shows a probability histogram of px(k) for k values ranging from 1 to 
21. Here the height of the kth bar is p(k), and since the width of each bar is 1, the 
area of the kth bar is also px(k). 

Now, look at Figure 3.4.2, where the exponential curve y = 0.le is super- 
imposed on the graph of px(k). Notice how closely the area under the curve 
approximates the area of the bars. It follows that the probability that X lies in some 


—0.1x 
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given interval will be numerically similar to the integral of the exponential curve 
above that same interval. 


Figure 3.4.1 0.1 
0.09 


0.08 


0.07 
0.06 
Px(k) 0.05 


0.04 


0.03 
0.02 


0.01 


123 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 


Hour when monitor first fails, k 


Figure 3.4.2 0.1 
0.09 


0.08 


0.07 
0.06 
Pxtk) 0.05 


y= 0.1e°0-1e 


a 


0.04 
0.03 
0.02 
0.01 


123 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 21 


Hour when monitor first fails, k 


For example, the probability that the monitor fails sometime during the first 
four hours would be the sum 


4 
PO<X<4)=)° px(k) 
k=0 
4 
= 5 °0.905)*~! (0.095) 
k=0 


= 0.3297 


To four decimal places, the corresponding area under the exponential curve is the 
same: 


4 
/ 0.1e~° * dx =0.3297 
0 


Example 
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Implicit in the similarity here between px(k) and the exponential curve y = 
0.1e~°!* is our sought-after alternative to p(s) for continuous sample spaces. Instead 
of defining probabilities for individual points, we will define probabilities for inter- 
vals of points, and those probabilities will be areas under the graph of some function 
(such as y=0.le~°!*), where the shape of the function will reflect the desired 
probability “measure” to be associated with the sample space. 


Definition 3.4.1. A probability function P on a set of real numbers S is called 
continuous if there exists a function f(t) such that for any closed interval [a, b] c 


S, P(la, bl) =f? fndt. 


Comment If a probability function P satisfies Definition 3.4.1, then P(A) = 
J, f(t) dt for any set A where the integral is defined. 
Conversely, suppose a function f(t) has the two properties 


1. f(t) =0 for all t. 
a. fF ara, 


If P(A) = f 4 /(t)dt for all A, then P will satisfy the probability axioms given in 
Section 2.3. 


Choosing the Function f(t) 


We have seen that the probability structure of any sample space with a finite 
or countably infinite number of outcomes is defined by the function p(s) = 
P(Outcome is s). For sample spaces having an uncountably infinite number of pos- 
sible outcomes, the function f(t) serves an analogous purpose. Specifically, f(t) 
defines the probability structure of S in the sense that the probability of any interval 
in the sample space is the integral of f(t). The next set of examples illustrate several 
different choices for f(t). 


The continuous equivalent of the equiprobable probability model on a discrete sam- 
ple space is the function f(t) defined by f(t) = 1/(b —a) for all f in the interval [a, b] 
(and f(t) =0, otherwise). This particular f(r) places equal probability weighting on 
every closed interval of the same length contained in the interval [a, b]. For example, 
suppose a =0 and b= 10, and let A=[1, 3] and B =[6, 8]. Then f(t)= 4, and 


10° 
pa=[(t)a=2=rm=[ (4)a 


(See Figure 3.4.3.) 


P(A) == P(B) == 
5 wa Ze \ 


IO = 55 


Probability 
density 


Figure 3.4.3 
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Example 


3.4.2 


Example 


3.4.3 


Could f(t) =3t?, 0<t <1, be used to define the probability function for a continuous 
sample space whose outcomes consist of all the real numbers in the interval [0, 1]? 
Yes, because (1) f(t) > 0 for all r, and (2) ie f@dt= i, 3° dt=r lo =1. 

Notice that the shape of f(t) (see Figure 3.4.4) implies that outcomes close to 
1 are more likely to occur than are outcomes close to 0. For example, P ([0, +]) = 


i 3? dt=1)0" 1 while P (ee 1) =o 3rvedt=P lon a x 19 


0 — 27? 7 27" 


3 


2 f=30 
Probability 
density 


Figure 3.4.4 | 


By far the most important of all continuous probability functions is the “bell- 
shaped” curve, known more formally as the normal (or Gaussian) distribution. 
The sample space for the normal distribution is the entire real line; its probability 
function is given by 


f@= 


1/t-p 2 
, —WO<t<w, -wW<U<~w, ao>0 


1 
Vino 26 
Depending on the values assigned to the parameters 4 and o, f(t) can take on a 


variety of shapes and locations; three are illustrated in Figure 3.4.5. 


pba=—4 
o=0.5 


Figure 3.4.5 


Fitting f(t) to Data: The Density-Scaled Histogram 


The notion of using a continuous probability function to approximate an integer- 
valued discrete probability model has already been discussed (recall Figure 3.4.2). 
The “trick” there was to replace the spikes that define px (k) with rectangles whose 
heights are px (k) and whose widths are 1. Doing that makes the sum of the areas of 
the rectangles corresponding to px (k) equal to 1, which is the same as the total area 
under the approximating continuous probability function. Because of the equality 
of those two areas, it makes sense to superimpose (and compare) the “histogram” 
of px(k) and the continuous probability function on the same set of axes. 

Now, consider the related, but slightly more general, problem of using a con- 
tinuous probability function to model the distribution of a set of n measurements, 


Figure 3.4.6 


3.4 Continuous Random Variables 133 


Yi, Y2,---, Yn. Following the approach taken in Figure 3.4.2, we would start by mak- 
ing a histogram of the n observations. The problem, though, is that the sum of the 
areas of the bars comprising that histogram would not necessarily equal 1. 

As a case in point, Table 3.4.1 shows a set of forty observations. Grouping those 
y;’s into five classes, each of width 10, produces the distribution and histogram pic- 
tured in Figure 3.4.6. Furthermore, suppose we have reason to believe that these 
forty y;’s may be a random sample from a uniform probability function defined over 
the interval [20, 70]—that is, 

1 


=—, 20<t<70 
70—20 50 


fO= 


Table 3.4.1 


33.8 62.6 42.3 62.9 32.9 58.9 60.8 49.1 42.6 59.8 
41.6 545 405 30.3 22.4 25.0 59.2 67.5 641 59.3 
24.9 223 69.7 41.2 645 33.4 39.0 53.1 21.6 46.0 

28.1 68.7 27.6 57.6 548 48.9 684 384 69.0 46.6 


(recall Example 3.4.1). How can we appropriately draw the distribution of the y,’s 
and the uniform probability model on the same graph? 


12 
Class Frequency 

20 < y <30 7 8 

30<y <40 6 

40=y <50 9 Frequency 

50<y <60 8 4 

60<y<70 10 

40 0 y 


20 30 40 50 60 70 


Note, first, that f(t) and the histogram are not compatible in the sense that the 
area under f(t) is (necessarily) 1 (=50 x x), but the sum of the areas of the bars 
making up the histogram is 400: 


histogram area = 10(7) + 10(6) + 10(9) + 10(8) + 10(10) 
= 400 


Nevertheless, we can “force” the total area of the five bars to match the area under 
f(t) by redefining the scale of the vertical axis on the histogram. Specifically, fre- 
quency needs to be replaced with the analog of probability density, which would 
be the scale used on the vertical axis of any graph of f(t). Intuitively, the density 
associated with, say, the interval [20, 30) would be defined as the quotient 


7 
40 x 10 
because integrating that constant over the interval [20, 30) would give i, and the 
latter does represent the estimated probability that an observation belongs to the 
interval [20, 30). 
Figure 3.4.7 shows a histogram of the data in Table 3.4.1, where the height of 
each bar has been converted to a density, according to the formula 


class frequency 


density (of a class) = - - 
total no. of observations x class width 
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Superimposed is the uniform probability model, f(t) = > 20 <t <70. Scaled in this 


fashion, areas under both f(t) and the histogram are 1. 


EM t oe: Uniform probability 
Class Density function 

0.02 
20 < y <30 7/[40(10)] = 0.0175 : 
30<y<40  6/[40(10)] = 0.0150 Density 
40<y<50  9/[40(10)] = 0.0225 0.01 
50< y <60 8/[40(10)] = 0.0200 
60<y<70 10/[40(10)] = 0.0250 

0 y 


20 30 40 50 60 70 


In practice, density-scaled histograms offer a simple, but effective, format for 
examining the “fit” between a set of data and a presumed continuous model. 
We will use it often in the chapters ahead. Applied statisticians have especially 
embraced this particular graphical technique. Indeed, computer software packages 
that include Histograms on their menus routinely give users the choice of putting 
either frequency or density on the vertical axis. 


Case Study 3.4.1 


Years ago, the V805 transmitter tube was standard equipment on many aircraft 
radar systems. Table 3.4.2 summarizes part of a reliability study done on the 
V805; listed are the lifetimes (in hrs) recorded for 903 tubes (35). Grouped into 
intervals of width 80, the densities for the nine classes are shown in the last 


column. 
Table 3.4.2 

Lifetime (hrs) Number of Tubes Density 
0-80 317 0.0044 
80-160 230 0.0032 
160-240 118 0.0016 
240-320 93 0.0013 
320-400 49 0.0007 
400-480 33 0.0005 
480-560 17 0.0002 
560-700 26 0.0002 
700+ 20 0.0002 

903 


Experience has shown that lifetimes of electrical equipment can often be 
nicely modeled by the exponential probability function, 


f(tj)=se™, t>0 


where the value of 4 (for reasons explained in Chapter 5) is set equal to 
the reciprocal of the average lifetime of the tubes in the sample. Can the 
distribution of these data also be described by the exponential model? 


(Continued on next page) 
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One way to answer such a question is to superimpose the proposed model 
on a graph of the density-scaled histogram. The extent to which the two graphs 
are similar then becomes an obvious measure of the appropriateness of the 
model. 


0.004 
0.003 
F(t) = 0.005607 9-056 
Probability a 
‘: 0.002 
density 
0.001 Shaded area = P (lifetime > 500) 


80 240 400 500 560 700 
V805 lifetimes (hrs) 


Figure 3.4.8 


For these data, 4 would be 0.0056. Figure 3.4.8 shows the function 
fo =0.0056e""" 


plotted on the same axes as the density-scaled histogram. Clearly, the agreement 
is excellent, and we would have no reservations about using areas under f(t) to 
estimate lifetime probabilities. How likely is it, for example, that a V805 tube 
will last longer than five hundred hrs? Based on the exponential model, that 
probability would be 0.0608: 


CO 
P(V805 lifetime exceeds 500 hrs) = / 0.0056e~ °° dy 
500 


_ __-0.0056y | ___,—0.0056(500) __ —-2.8 __ 
=-e es =e =e = 0.0608 


Continuous Probability Density Functions 


We saw in Section 3.3 how the introduction of discrete random variables facilitates 
the solution of certain problems. The same sort of function can also be defined on 
sample spaces with an uncountably infinite number of outcomes. Usually, the sample 
space is an interval of real numbers —finite or infinite. The notation and techniques 
for this type of random variable replace sums with integrals. 


Definition 3.4.2. Let Y be a function from a sample space S to the real num- 
bers. The function Y is a called a continuous random variable if there exists a 
function fy(y) such that for any real numbers a and b witha <b 


b 
Psy <b)=[ fy (y)dy 
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Example 


3.4.4 


Example 


3.4.5 


The function fy(y) is the probability density function (pdf) for Y. 
As in the discrete case, the cumulative distribution function (cdf) is defined 
by 


Fy(y)= P(Y <y) 


The cdf in the continuous case is just an integral of fy(y), that is, 


, 
Fy(y) | fy (t)dt 


Let f(y) be an arbitrary real-valued function defined on some subset S of the 
real numbers. If 


1. f(y) =0 for all yin S and 
2. f fro)dy=1 


then f(y) = fy() for all y, where the random variable Y is the identity mapping. 


We saw in Case Study 3.4.1 that lifetimes of V805 radar tubes can be nicely modeled 
by the exponential probability function 


(y= 0.0056e 0", 250 
: 


To couch that statement in random variable notation would simply require that we 
define Y to be the life of a V805 radar tube. Then Y would be the identity mapping, 
and the pdf for the random variable Y would be the same as the probability function, 
f(t). That is, we would write 


fry) = 0.005679, y > 


Similarly, when we work with the bell-shaped normal distribution in later chapters, 
we will write the model in random variable notation as 


1 4308)’ 
fro)= e*\7/, -w<y<@ 


V 210 


Suppose we would like a continuous random variable Y to “select” a number 
between 0 and 1 in such a way that intervals near the middle of the range would 
be more likely to be represented than intervals near either 0 or 1. One pdf having 
that property is the function fy(y) =6y(1 — y),0< y < 1 (see Figure 3.4.9). Do we 
know for certain that the function pictured in Figure 3.4.9 is a “legitimate” pdf? Yes, 


because fy(y) > 0 for all y, and de 6y(1 — y) dy = 6[y?/2 — y7/31|9 =1. 


Comment To simplify the way pdfs are written, it will be assumed that fy(y) =0 for 
all y outside the range actually specified in the funtion’s definition. In Example 3.4.5, 


fy(y) = 6y(1-y) 


e 


NIB RNR 


Probability 
density 


y 


NI- 
lw 


a 
4 


Figure 3.4.9 


Theorem 
3.4.1 


Theorem 
3.4.2 
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for instance, the statement fy(y) =6y(1 — y), 0O< y <1, is to be interpreted as an 
abbreviation for 


0, y <0 
frY= 4 6yd—-y), O<y<1 
0, y>l = 


Continuous Cumulative Distribution Functions 


Associated with every random variable, discrete or continuous, is a cumulative dis- 
tribution function. For discrete random variables (recall Definition 3.3.4), the cdf 
is a nondecreasing step function, where the “jumps” occur at the values of ¢ for 
which the pdf has positive probability. For continuous random variables, the cdf is 
a monotonically nondecreasing continuous function. In both cases, the cdf can be 
helpful in calculating the probability that a random variable takes on a value in a 
given interval. As we will see in later chapters, there are also several important rela- 
tionships that hold for continuous cdfs and pdfs. One such relationship is cited in 
Theorem 3.4.1. 


Definition 3.4.3. The cdf for a continuous random variable Y is an indefinite 
integral of its pdf: 


P 
F=f fr(r) dr= P(iseS|¥(s)<y))=PW sy) 


Let Fy(y) be the cdf of a continuous random variable Y. Then 


d 
Wir) = frQ) 
y 


Proof The statement of Theorem 3.4.1 follows immediately from the Fundamental 
Theorem of Calculus. 


Let Y be a continuous random variable with cdf Fy(y). Then 


a. P(Y >s)=1-— Fy(s) 
b. P(r <Y <s)=Fy(s)— Fy(r) 
c. lim Fy(y)=1 
d. lim Fy(y)=0 
y>—00 
Proof 


a. P(Y > s)=1-— P(Y <s) since (Y > 5s) and (Y <s) are complementary events. 
But P(Y <s)= Fy(s), and the conclusion follows. 

b. Since the set (r < Y<s)=(¥ <s)—-(¥Y <r), Pr <Y¥ <s)=P(Y <s)—-P(< 
r)= Fy(s)— Fy(r). 


c. Let {y,} be a set of values of Y,n=1,2,3,..., where yy < y,4, for all n, and 
lim y, =oo. If lim Fy(y,)=1 for every such sequence {y,}, then lim Fy(y)=1. 
noo n—->oo yoo 


To that end, set A; =(Y < y,), and let Ay = (yn_1 < Y < yn) forn =2,3,.... Then 
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Fy (yn) = P(Up_; Ax) = >> P(A,), since the A;,’s are disjoint. Also, the sample 


k=1 


CO 
space S=UP-, Ax, and by Axiom 4, 1 = P(S) = P(UZ, Ax) = > P(A,). Putting 


k=1 


CO n 
these equalities together gives 1= )> P(A;,) = lim )~ P(A,) = lim Fy(y,). 
k=0 n—->oo k=0 n—>OCo 


d. lim Fy(y)= lim P(Y<y)= lim P(-Y>~—y)= lim [1— P(-Y <—y)] 
y—>—00 y—>—0o0 y—>—00 y—>—00 


=1— lim P(—Y <—y)=1- lim P(-Y <y) 
yo —-0O yoo 


=1- lim F_y(y)=0 
yoo 


Questions 


3.4.1. Suppose fy(y) =4y?,0< y <1. Find P(O<Y <3). 


3.4.2. For the random variable Y with pdf f,(y) = : + 
<y,0<y<l, find P(?<Y <1). 

3.4.3. Let fy(y) = 3y?, -l<y <1. Find P(|¥ — 3] < ). 
Draw a graph of fy() and show the area representing the 
desired probability. 


3.4.4. For persons infected with a certain form of malaria, 
the length of time spent in remission is described by the 
continuous pdf fy (y) = yy 0< y <3, where Y is measured 
in years. What is the probability that a malaria patient’s 
remission lasts longer than one year? 


3.4.5. The length of time, Y, that a customer spends in line 
at a bank teller’s window before being served is described 
by the exponential pdf f(y) =0.2e°’, y > 0. 


(a) What is the probability that a customer will wait more 
than ten minutes? 

(b) Suppose the customer will leave if the wait is more 
than ten minutes. Assume that the customer goes to 
the bank twice next month. Let the random variable 
X be the number of times the customer leaves without 
being served. Calculate px (1). 


3.4.6. Let n be a positive integer. Show that fy(y) = 
(n+2)(n+ ly" —y),0<y<1,isa pdf. 


3.4.7. Find the cdf for the random variable Y given in 
Question 3.4.1. Calculate P(0 < Y < $) using Fy(y). 


3.4.8. If Y is an exponential random variable, fy(y) = 
he”, y > 0, find Fy (y). 


3.4.9. If the pdf for Y is 


ly|>1 


0, 
1—|y|, lyl<1 


fro | 


find and graph Fy (y). 


3.4.10. A continuous random variable Y has a cdf 
given by 


0 y<0 
Fy(yy=4 y? Osy<l 
1 y>l 


Find P(} < Y < 3) two ways—first, by using the cdf and 


second, by using the pdf. 
3.4.11. A random variable Y has cdf 


0 y<l 
Fy(y)=4 Iny l<y<e 
1 e<y 
Find 
(a) P(Y <2) 


(b) P(2<Y <23) 
(c) P(2<Y <2) 
(d) fr) 


3.4.12. The cdf for a random variable Y is defined by 
Fy(y) = 0 for y <0; Fy(y) = 4y? — 3y* for 0 < y < 1; and 
Fy(y)=1 for y > 1. Find P(i <Y < >) by integrating fy(y). 


3.4.13. Suppose Fy(y) = 5(y? + y*),0<y <2. Find fy(y). 


3.4.14. In a certain country, the distribution of a fam- 
ily’s disposable income, Y, is described by the pdf fy (y) = 
ye’, y>0. Find Fy(y). 


3.4.15. The logistic curve F(y) = a —oo < y <0, can 


represent a cdf since it is increasing, lim ea = 0, and 
yoo Ite 

lim 

y> +00 


the associated pdf. 


7 ts = 1. Verify these three assertions and also find 


3.4.16. Let Y be the random variable described in Ques- 
tion 3.4.1. Define W = 2Y. Find fy(w). For which values 
of wis fy(w) £0? 
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3.4.17. Suppose that f,(y) is a continuous and symmetric survived until time y is called the hazard rate, h(y). In 
pdf, where symmetry is the property that f,(y) = fy(—y) — terms of the pdf and cdf, 
for all y. Show that P(—a < Y <a) =2Fy(a)—1. 


fry) 


3.4.18. Let Y be a random variable denoting the age at A(y)= 1— Fy) 
which a piece of equipment fails. In reliability theory, the 
probability that an item fails at time y given that it has Find A(y) if Y has an exponential pdf (see Question 3.4.8). 


Figure 3.5.1 


3.5 Expected Values 


Probability density functions, as we have already seen, provide a global overview 
of a random variable’s behavior. If X is discrete, px(k) gives P(X =k) for all k; if 
Y is continuous, and A is any interval or a countable union of intervals, P(Y « A) = 
J, fr@) dy. Detail that explicit, though, is not always necessary—or even helpful. 
There are times when a more prudent strategy is to focus the information contained 
in a pdf by summarizing certain of its features with single numbers. 

The first such feature that we will examine is central tendency, a term referring 
to the “average” value of a random variable. Consider the pdfs px(k) and fy(y) 
pictured in Figure 3.5.1. Although we obviously cannot predict with certainty what 
values any future X’s and Y’s will take on, it seems clear that X values will tend to 
lie somewhere near pry, and Y values somewhere near py. In some sense, then, we 
can characterize px(k) by wx, and fy(y) by uy. 


Px (k) 
fy) 


by Ly 


The most frequently used measure for describing central tendency — that is, for 
quantifying wy and wy —is the expected value. Discussed at some length in this 
section and in Section 3.9, the expected value of a random variable is a slightly 
more abstract formulation of what we are already familiar with in simple discrete 
settings as the arithmetic average. Here, though, the values included in the average 
are “weighted” by the pdf. 

Gambling affords a familiar illustration of the notion of an expected value. Con- 
sider the game of roulette. After bets are placed, the croupier spins the wheel and 
declares one of thirty-eight numbers, 00, 0, 1, 2, ..., 36, to be the winner. Disre- 
garding what seems to be a perverse tendency of many roulette wheels to land on 
numbers for which no money has been wagered, we will assume that each of these 
thirty-eight numbers is equally likely (although only the eighteen numbers 1, 3, 5, 
..., 35 are considered to be odd and only the eighteen numbers 2, 6, 4, ..., 36 are 
considered to be even). Suppose that our particular bet (at “even money”) is $1 on 
odds. If the random variable X denotes our winnings, then X takes on the value / if 
an odd number occurs, and —1 otherwise. Therefore, 


18 9 
Da PX=)=— = — 
px(1) = P( )= 33 = To 
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Figure 3.5.2 


and 


20 10 
~ 38 «19 


px(-1l) = P(X =-1) 


Then a of the time we will win $1 and x of the time we will lose $1. Intuitively, 
then, if we persist in this foolishness, we stand to lose, on the average, a little more 


than 5 ¢ each time we play the game: 


9 10 
“ t d” bt 1 — 1 a! =. ya 1 es 
expected” winnings = $ 19 + (-$1) 19 
= —$0.053 = —5¢ 


The number —0.053 is called the expected value of X. 

Physically, an expected value can be thought of as a center of gravity. Here, for 
example, imagine two bars of height x and a positioned along a weightless X-axis 
at the points —1 and +1, respectively (see Figure 3.5.2). If a fulcrum were placed at 
the point —0.053, the system would be in balance, implying that we can think of that 
point as marking off the center of the random variable’s distribution. 


—0.053 


If X is a discrete random variable taking on each of its values with the same 
probability, the expected value of X is simply the everyday notion of an arithmetic 
average or mean: 


1 1 
expected value of X = Sok ‘= Sok 
n n 
all k all k 


Extending this idea to a discrete X described by an arbitrary pdf, py (k), gives 


expected value of X = > k- px(k) (3.5.1) 
all k 


For a continuous random variable Y, the summation in Equation 3.5.1 is replaced 
by an integration and k- px(k) becomes y- fy(y). 


Definition 3.5.1. Let X be a discrete random variable with probability function 
px(k). The expected value of X is denoted E(X) (or sometimes yz or jzx) and is 
given by 


E(X)=p="x =) k- px(k) 
allk 


Similarly, if Y is a continuous random variable with pdf fy(y), 


(oe) 


y- fr(y) dy 


EW) =n=ny= [ 


Example 
3.5.1 


Theorem 
3.5.1 
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Comment We assume that both the sum and the integral in Definition 3.5.1 
converge absolutely: 


Dikipxe<co — f_Iylfr) dy <oe 


all k 


If not, we say that the random variable has no finite expected value. One immediate 
reason for requiring absolute convergence is that a convergent sum that is not abso- 
lutely convergent depends on the order in which the terms are added, and order 
should obviously not be a consideration when defining an average. 


Suppose X is a binomial random variable with p= 2 and n=3. Then py(k)= P(X = 


k) =) Gyray~ k =0, 1, 2, 3. What is the expected value of X? 
Applying Definition 3.5.1 gives 


3 3 5 k 4 3—k 
oP QIO 
Ze k)\9) \o 
64 240 300 125\ 1215 5 5 
=) (= )+o (F)+@ ()+@(B)- 729 +=3(3) 


Comment Notice that the expected value here reduces to five-thirds, which can be 
written as three times five-ninths, the latter two factors being n and p, respectively. 
As the next theorem proves, that relationship is not a coincidence. = 


Suppose X is a binomial random variable with parameters n and p. Then E(X) =np. 


Proof According to Definition 3.5.1, F(X) for a binomial random variable is the sum 


n 


E(X)=)0k- px(= def) ota — py 


k=0 k=0 
oe ae i 
=S oa pa py 
“ k\(n—k)! 
n n!\ 


k n—k 
pk(—p) (3.5.2) 
A+ (k= D\(n—b! 


At this point, a trick is called for. If E(X) = > g(k) can be factored in such a way 
allk 
that E(X)=h > py«(k), where py+(k) is the pdf for some random variable X*, then 


allk 
E(X)=h, since the sum of a pdf over its entire range is 1. Here, suppose that np is 


factored out of Equation 3.5.2. Then 
= (n—1)! k-1 n—k 
E(X)= 1 
(X) “ps k—Din bl? (1 — p) 


“.(n-1 7 oa 
=m (7 1)e" ‘(1 — py" 
k=1 
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3.5.2 


Theorem 
3.5.2 


Now, let j =k — 1. It follows that 


n—-1 
E(X)=np)> ("| ') pia =p 


j=0 
Finally, letting m =n — 1 gives 
m Pe 
E(X)=np)> ( ‘pa py" 
a, el 
j 
and, since the value of the sum is 1 (why?), 


E(X)=np (3.5.3) 


Comment The statement of Theorem 3.5.1 should come as no surprise. If a 
multiple-choice test, for example, has one hundred questions, each with five pos- 
sible answers, we would “expect” to get twenty correct, just by guessing. But if the 
random variable X denotes the number of correct answers (out of one hundred), 
20 = E(X) = 100(+) =np. 


An urn contains nine chips, five red and four white. Three are drawn out at random 
without replacement. Let X denote the number of red chips in the sample. Find 
E(X). 

From Section 3.2, we recognize X to be a hypergeometric random variable, 


where 
(0) (5-0) 


(3) 


P(X =k) = px(k)= k=0,1,2,3 


Therefore, 


2o)= or Oe 


-(5)+0(Z)+@(F)+o(F) 


Comment As was true in Example 3.5.1, the value found here for E(X) suggests a 
general formula—in this case, for the expected value of a hypergeometric random 
variable. 


Suppose X is a hypergeometric random variable with parameters r, w, and n. That is, 
suppose an urn contains r red balls and w white balls. A sample of size n is drawn 
simultaneously from the urn. Let X be the number of red balls in the sample. Then 


E(X)= a 


Proof See Question 3.5.25. 


Example 
3.5.3 


Example 
3.5.4 
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r 


Comment Let p represent the proportion of red balls in an urn—that is, p = =. 
The formula, then, for the expected value of a hypergeometric random variable 
has the same structure as the formula for the expected value of a binomial random 
variable: 


E(X) rm r 
— =n =np 
r+w r+w : 


Among the more common versions of the “numbers” racket is a game called D.J., its 
name deriving from the fact that the winning ticket is determined from Dow Jones 
averages. Three sets of stocks are used: Industrials, Transportations, and Utilities. 
Traditionally, the three are quoted at two different times, 11 a.m. and noon. The last 
digits of the earlier quotation are arranged to form a three-digit number; the noon 
quotation generates a second three-digit number, formed the same way. Those two 
numbers are then added together and the last three digits of that sum become the 
winning pick. Figure 3.5.3 shows a set of quotations for which 906 would be declared 
the winner. 


11 A.M. quotation Noon quotation 
Industrials 845.6:1' Industrials 848.1'7' 
Transportation 375.2'71 Transportation 376.7'3; 
Utilities 110.6'3) Utilities 110.613) 

173, + = 733 


906 = Winning number 
Figure 3.5.3 


The payoff in D.J. is 700 to 1. Suppose that we bet $5. How much do we stand to 
win, or lose, on the average? 

Let p denote the probability of our number being the winner and let X denote 
our earnings. Then 


$3500 with probability p 
—$5 with probability 1 — p 
and 
E(X) =$3500- p —$5-(1— p) 


Our intuition would suggest (and this time it would be correct!) that each of the 
possible winning numbers, 000 through 999, is equally likely. That being the case, 
p=1/1000 and 


1 999 
B(x) = 83500: (7) s.(7)- $1.50 


On the average, then, we lose $1.50 on a $5.00 bet. = 


Suppose that fifty people are to be given a blood test to see who has a certain disease. 
The obvious laboratory procedure is to examine each person’s blood individually, 
meaning that fifty tests would eventually be run. An alternative strategy is to divide 
each person’s blood sample into two parts—say, A and B. All of the A’s would then 
be mixed together and treated as one sample. If that “pooled” sample proved to be 


144 Chapter 3 Random Variables 


Example 


3.5.5 


negative for the disease, all fifty individuals must necessarily be free of the infection, 
and no further testing would need to be done. If the pooled sample gave a positive 
reading, of course, all fifty B samples would have to be analyzed separately. Under 
what conditions would it make sense for a laboratory to consider pooling the fifty 
samples? 

In principle, the pooling strategy is preferable (i.e., more economical) if it can 
substantially reduce the number of tests that need to be performed. Whether or not 
it can do so depends ultimately on the probability p that a person is infected with 
the disease. 

Let the random variable X denote the number of tests that will have to be 
performed if the samples are pooled. Clearly, 


a 1 if none of the fifty is infected 
51 if at least one of the fifty is infected 
But 
P(X =1)=px(1)= P(None of the fifty is infected) 


=(1—p) 
(assuming independence), and 
P(X =51) =px(51) =1— P(X =1)=1-(1—- p)™ 
Therefore, 
E(X)=1-(1— p)+51-[1-(1— p)"] 


Table 3.5.1 shows E(X) as a function of p. As our intuition would suggest, 
the pooling strategy becomes increasingly feasible as the prevalence of the disease 
diminishes. If the chance of a person being infected is 1 in 1000, for example, the 
pooling strategy requires an average of only 3.4 tests, a dramatic improvement over 
the fifty tests that would be needed if the samples were tested one by one. On the 
other hand, if 1 in 10 individuals is infected, pooling would be clearly inappropriate, 
requiring more than fifty tests [E(X) = 50.7]. 


Table 3.5.1 

Pp E(X) 
0.5 51.0 
0.1 50.7 
0.01 20.8 
0.001 3.4 
0.0001 1.2 

o 


Consider the following game. A fair coin is flipped until the first tail appears; we win 
$2 if it appears on the first toss, $4 if it appears on the second toss, and, in general, 
$2 if it first occurs on the kth toss. Let the random variable X denote our winnings. 
How much should we have to pay in order for this to be a fair game? [Note: A fair 
game is one where the difference between the ante and E(X) is 0.] 

Known as the St. Petersburg paradox, this problem has a rather unusual answer. 
First, note that 


1 
pxQ')=P(X=2)= =, k=1,2,... 


Example 
3.5.6 
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Therefore, 


o.e) 
1 
E(X)= 1K px Qh = IH. Salt lt ite 
allk k=1 
which is a divergent sum. That is, X does not have a finite expected value, so in order 
for this game to be fair, our ante would have to be an infinite amount of money! m 


Comment Mathematicians have been trying to “explain” the St. Petersburg para- 
dox for almost two hundred years (56). The answer seems clearly absurd—no 
gambler would consider paying even $25 to play such a game, much less an infinite 
amount—yet the computations involved in showing that X has no finite expected 
value are unassailably correct. Where the difficulty lies, according to one common 
theory, is with our inability to put in perspective the very small probabilities of win- 
ning very large payoffs. Furthermore, the problem assumes that our opponent has 
infinite capital, which is an impossible state of affairs. We get a much more reason- 
able answer for E(X) if the stipulation is added that our winnings can be at most, 
say, $1000 (see Question 3.5.19) or if the payoffs are assigned according to some 
formula other than 2* (see Question 3.5.20). 


Comment There are two important lessons to be learned from the St. Petersburg 
paradox. First is the realization that E(X) is not necessarily a meaningful character- 
ization of the “location” of a distribution. Question 3.5.24 shows another situation 
where the formal computation of E(X) gives a similarly inappropriate answer. Sec- 
ond, we need to be aware that the notion of expected value is not necessarily 
synonymous with the concept of worth. Just because a game, for example, has a 
positive expected value—even a very large positive expected value—does not imply 
that someone would want to play it. Suppose, for example, that you had the oppor- 
tunity to spend your last $10,000 on a sweepstakes ticket where the prize was $1 
billion but the probability of winning was only one in ten thousand. The expected 
value of such a bet would be over $90,000, 


E(X)= $1,000,000,000( 


) + (—$10,000) ( stk ) 


1 
10,000 10,000 
= $90,001 
but it is doubtful that many people would rush out to buy a ticket. (Economists have 
long recognized the distinction between a payoff’s numerical value and its perceived 
desirability. They refer to the latter as utility.) 


The distance, Y, that a molecule in a gas travels before colliding with another 
molecule can be modeled by the exponential pdf 


1. 
frQ)= ae y>0 


where yp is a positive constant known as the mean free path. Find E(Y). 
Since the random variable here is continuous, its expected value is an integral: 


ae | 
E(Y)= / ge dy 
0 LM 
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Example 


3.5.7 


Let w= y/p, so that dw = 1/p dy. Then E(Y) =p we “dw. Setting u = w and 
dv=e~—“’dw and integrating by parts gives 


E(Y)=pl-we” — ep = (3.5.4) 


Equation 3.5.4 shows that y is aptly named —it does, in fact, represent the aver- 
age distance a molecule travels, free of any collisions. Nitrogen (N2), for example, 
at room temperature and standard atmospheric pressure has 4 = 0.00005 cm. An 
N> molecule, then, travels that far before colliding with another N2 molecule, on the 
average. = 


One continuous pdf that has a number of interesting applications in physics is the 

Rayleigh distribution, where the pdf is given by 

froy= cs ev” a>0; 0< y<oo (3.5.5) 
a 


Calculate the expected value for a random variable having a Rayleigh distribution. 
From Definition 3.5.1, 


me y 2 1992 
EW)= | ye "dy 
0 a 


Let v= y/(/2a). Then 
E() =2v2a [ ve’ du 
0 


The integrand here is a special case of the general form ve” Fork=1, 


[o.@) 5 [o.@) 4 1 
/ ve’ du =) ve” dv=—J/x 
0 0 4 


Therefore, 
1 
E(Y)=2V2a- yvm 
= aap i 


Comment The pdf here is named for John William Strutt, Baron Rayleigh, the 
nineteenth- and twentieth-century British physicist who showed that Equation 3.5.5 
is the solution to a problem arising in the study of wave motion. If two waves are 
superimposed, it is well known that the height of the resultant at any time t is sim- 
ply the algebraic sum of the corresponding heights of the waves being added (see 
Figure 3.5.4). Seeking to extend that notion, Rayleigh posed the following question: 
If n waves, each having the same amplitude / and the same wavelength, are super- 
imposed randomly with respect to phase, what can we say about the amplitude R of 
the resultant? Clearly, R is a random variable, its value depending on the particular 
collection of phase angles represented by the sample. What Rayleigh was able to 
show in his 1880 paper (166) is that when n is large, the probabilistic behavior of R 
is described by the pdf 


2. : 
faa eri, r>0 
n 


which is just a special case of Equation 3.5.5 with a=./2/nh?. 


Figure 3.5.4 


Example 
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A Second Measure of Central Tendency: The Median 


While the expected value is the most frequently used measure of a random vari- 
able’s central tendency, it does have a weakness that sometimes makes it misleading 
and inappropriate. Specifically, if one or several possible values of a random vari- 
able are either much smaller or much larger than all the others, the value of 4 can 
be distorted in the sense that it no longer reflects the center of the distribution in 
any meaningful way. For example, suppose a small community consists of a homo- 
geneous group of middle-range salary earners, and then Bill Gates moves to town. 
Obviously, the town’s average salary before and after the multibillionaire arrives 
will be quite different, even though he represents only one new value of the “salary” 
random variable. 

It would be helpful to have a measure of central tendency that is not so sensitive 
to “outliers” or to probability distributions that are markedly skewed. One such 
measure is the median, which, in effect, divides the area under a pdf into two equal 
areas. 


Definition 3.5.2. If X is a discrete random variable, the median, m, is that point 
for which P(X <m)= P(X >™m). In the event that P(X < m)=0.5 and P(X > 
m’) = 0.5, the median is defined to be the arithmetic average, (m + m’)/2. 

If Y is a continuous random variable, its median is the solution to the 
integral equation [”.. fy(y) dy =0.5. 


If a random variable’s pdf is symmetric, both yz and m will be equal. Should px (k) or 
fy(y) not be symmetric, though, the difference between the expected value and the 
median can be considerable, especially if the asymmetry takes the form of extreme 
skewness. The situation described here is a case in point. 

Soft-glow makes a 60-watt light bulb that is advertised to have an average life 
of one thousand hours. Assuming that the performance claim is valid, is it rea- 
sonable for consumers to conclude that the Soft-glow bulbs they buy will last for 
approximately one thousand hours? 

No! If the average life of a bulb is one thousand hours, the (continuous) pdf, 
fy (y), modeling the length of time, Y, that it remains lit before burning out is likely 
to have the form 


fr(y) =0.001e°',  y>0 (3.5.6) 
(for reasons explained in Chapter 4). But Equation 3.5.6 is a very skewed pdf, having 


a shape much like the curve drawn in Figure 3.4.8. The median for such a distribution 
will lie considerably to the left of the mean. 
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More specifically, the median lifetime for these bulbs—according to Defini- 
tion 3.5.2—is the value m for which 


/ 0.001270!’ dy = 0.5 
0 


But 5" 0.001e~°!'dy = 1 — el". Setting the latter equal to 0.5 implies that 


m = (1/—0.001) In(0.5) = 693 


So, even though the average life of one of these bulbs is 1000 hours, there is a 50% 
chance that the one you buy will last less than 693 hours. o 


Questions 


3.5.1. Recall the game of Keno described in Ques- 
tion 3.2.26. The following are all the payoffs on a $1 
wager where the player has bet on ten numbers. Calculate 
E(X), where the random variable X denotes the amount 
of money won. 


Number of Correct Guesses_ Payoff Probability 
<5 —$1 935 
5 2 .0514 
6 18 0115 
7 180 .0016 
8 1,300 1.35 x 10-4 
9 2,600 6.12 x 10°° 
10 10,000 1.12 x 10-7 


3.5.2. The roulette wheels in Monte Carlo typically have 
a 0 but not a 00. What is the expected value of betting on 
red in this case? If a trip to Monte Carlo costs $3000, how 
much would a player have to bet to justify gambling there 
rather than Las Vegas? 


3.5.3. The pdf describing the daily profit, X, earned by 
Acme Industries was derived in Example 3.3.7. Find the 
company’s average daily profit. 


3.5.4. In the game of redball, two drawings are made 
without replacement from a bowl that has four white ping- 
pong balls and two red ping-pong balls. The amount won is 
determined by how many of the red balls are selected. For 
a $5 bet, a player can opt to be paid under either Rule A 
or Rule B, as shown. If you were playing the game, which 
would you choose? Why? 


A B 
No. of Red No. of Red 
Balls Drawn Payoff Balls Drawn Payoff 
0 0 0 0 
1 $2 1 $1 
2 $10 2 $20 


3.5.5. Suppose a life insurance company sells a $50,000, 
five-year term policy to a twenty-five-year-old woman. At 
the beginning of each year the woman is alive, the com- 
pany collects a premium of $P. The probability that the 
woman dies and the company pays the $50,000 is given 
in the table below. So, for example, in Year 3, the com- 
pany loses $50,000 — $P with probability 0.00054 and gains 
$P with probability 1 — 0.00054 = 0.99946. If the company 
expects to make $1000 on this policy, what should P be? 


Year Probability of Payoff 
1 0.00051 
2 0.00052 
3 0.00054 
4 0.00056 
5 0.00059 


3.5.6. A manufacturer has one hundred memory chips 
in stock, 4% of which are likely to be defective (based 
on past experience). A random sample of twenty chips is 
selected and shipped to a factory that assembles laptops. 
Let X denote the number of computers that receive faulty 
memory chips. Find E(X). 

3.5.7. Records show that 642 new students have just 
entered a certain Florida school district. Of those 642, a 
total of 125 are not adequately vaccinated. The district’s 
physician has scheduled a day for students to receive what- 
ever shots they might need. On any given day, though, 
12% of the district’s students are likely to be absent. 
How many new students, then, can be expected to remain 
inadequately vaccinated? 


3.5.8. Calculate E(Y) for the following pdfs: 


(a) fry) =301—y)?,0<y<1 
(b) fy(y)=4ye”, y >0 
7, O<y<l 


(c) fro)= 


a> 

iy 2SyS3 
0, elsewhere 
y; 


O<y<5 


(d) f(y) =sin 


3.5.9. Recall Question 3.4.4, where the length of time Y 
(in years) that a malaria patient spends in remission has 
pdf fro) = iy’, 0 < y <3. What is the average length of 
time that such a patient spends in remission? 


3.5.10. Let the random variable Y have the uniform dis- 
tribution over [a,b]; that is, fy(y) = A fora<y<b. 
Find E(Y) using Definition 3.5.1. Also, deduce the value 
of E(Y), knowing that the expected value is the center of 


gravity of fy(y). 


3.5.11. Show that the expected value associated with the 
exponential distribution, fy(y)=Ae~*’, y > 0, is 1/A, where 
A is a positive constant. 


3.5.12. Show that 


is a valid pdf but that Y does not have a finite expected 
value. 


3.5.13. Based on recent experience, ten-year-old passen- 
ger cars going through a motor vehicle inspection station 
have an 80% chance of passing the emissions test. Suppose 
that two hundred such cars will be checked out next week. 
Write two formulas that show the number of cars that are 
expected to pass. 


3.5.14. Suppose that fifteen observations are chosen at 
random from the pdf fy(y) =3y?,0< y <1. Let X denote 
the number that lie in the interval (4, 1). Find E(X). 


3.5.15. A city has 74,806 registered automobiles. Each is 
required to display a bumper decal showing that the owner 
paid an annual wheel tax of $50. By law, new decals need 
to be purchased during the month of the owner’s birth- 
day. How much wheel tax revenue can the city expect to 
receive in November? 


3.5.16. Regulators have found that twenty-three of the 
sixty-eight investment companies that filed for bankruptcy 
in the past five years failed because of fraud, not for rea- 
sons related to the economy. Suppose that nine additional 
firms will be added to the bankruptcy rolls during the 
next quarter. How many of those failures are likely to be 
attributed to fraud? 


3.5.17. An urn contains four chips numbered 1 through 4. 
Two are drawn without replacement. Let the random 
variable X denote the larger of the two. Find E(x). 


3.5.18. A fair coin is tossed three times. Let the random 
variable X denote the total number of heads that appear 
times the number of heads that appear on the first and 
third tosses. Find E(X). 


3.5.19. How much would you have to ante to make the 
St. Petersburg game “fair” (recall Example 3.5.5) if the 
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most you could win was $1000? That is, the payoffs are $2* 
for 1<k <9, and $1000 for k > 10. 


3.5.20. For the St. Petersburg problem (Example 3.5.5), 
find the expected payoff if 


(a) the amounts won are c* instead of 2‘, where 0<c < 2. 

(b) the amounts won are log 2‘. [This was a modi- 
fication suggested by D. Bernoulli (a nephew of 
James Bernoulli) to take into account the decreasing 
marginal utility of money—the more you have, the 
less useful a bit more is.] 


3.5.21. A fair die is rolled three times. Let X denote 
the number of different faces showing, X = 1,2,3. 
Find E(X). 


3.5.22. Two distinct integers are chosen at random from 
the first five positive integers. Compute the expected 
value of the absolute value of the difference of the two 
numbers. 


3.5.23. Suppose that two evenly matched teams are play- 
ing in the World Series. On the average, how many games 
will be played? (The winner is the first team to get four 
victories.) Assume that each game is an independent 
event. 


3.5.24. An urn contains one white chip and one black 
chip. A chip is drawn at random. If it is white, the “game” 
is over; if it is black, that chip and another black one are 
put into the urn. Then another chip is drawn at random 
from the “new” urn and the same rules for ending or con- 
tinuing the game are followed (i.e., if the chip is white, the 
game is over; if the chip is black, it is placed back in the 
urn, together with another chip of the same color). The 
drawings continue until a white chip is selected. Show that 
the expected number of drawings necessary to get a white 
chip is not finite. 


3.5.25. A random sample of size n is drawn without 
replacement from an urn containing r red chips and w 
white chips. Define the random variable X to be the num- 
ber of red chips in the sample. Use the summation tech- 
nique described in Theorem 3.5.1 to prove that E(X) = 
rn/(r+w). 


3.5.26. Given that X is a nonnegative, integer-valued 
random variable, show that 
E(X)=)° P(X=k) 
k=1 


3.5.27. Find the median for each of the following pdfs: 


(a) fro) =(+1)y’, 0<y <1, where 6 >0 
(b) fro)=y+}, O<yK<l 
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Theorem 
3.5.3 


Corollary 
3.5.1 


The Expected Value of a Function of a Random Variable 


There are many situations that call for finding the expected value of a function of a 
random variable—say, Y = g(X). One common example would be change of scale 
problems, where g(X) = aX +b for constants a and b. Sometimes the pdf of the 
new random variable Y can be easily determined, in which case E(Y) can be cal- 
culated by simply applying Definition 3.5.1. Often, though, fy(y) can be difficult to 
derive, depending on the complexity of g(X). Fortunately, Theorem 3.5.3 allows us 
to calculate the expected value of Y without knowing the pdf for Y. 


Suppose X is a discrete random variable with pdf px(k). Let g(X) be a function of X. 
Then the expected value of the random variable g(X) is given by 


Elg(X)]= >> gh): px& 


all k 


provided that S lg(k)|px(k) <0. 


all k 
If Y is a continuous random variable with pdf fy(y), and if g(Y) is a continuous 


function, then the expected value of the random variable g(Y) is 


Ele(¥)]= / or HON 


provided that jee Ig(y)| fry) dy < 00. 


Proof We will prove the result for the discrete case. See (146) for details showing 
how the argument is modified when the pdf is continuous. Let W = g(X). The set of 
all possible k values, k, kz, ..., will give rise to a set of w values, w), wo,..., where, in 
general, more than one k may be associated with a given w. Let S; be the set of k’s 
for which g(k) = w; [so U;S; is the entire set of k values for which px (k) is defined]. 
We obviously have that P(W = w,;) = P(X € S;), and we can write 


E(W)=)\w;-P(W=w))=)_w;- P(XES;) 
j J 


=o uj Vo px) 


j keS; 


=o do uj: px) 


j kes; 


= = »: g(k)px(k) (why?) 


j kes; 
=) > sb)px(k) 
all k 
Since it is being assumed that )~ |g(k)|px(k) < oo, the statement of the theorem 
all 


holds. 


all k 


For any random variable W, E(aW +b) =aE(W) +), where a and b are constants. 4 


Proof Suppose W is continuous; the proof for the discrete case is similar. By The- 
orem 3.5.3, E(aW +b) = [2 (aw + b) fw(w) dw, but the latter can be written 


af. w- fw(w) dw+b fo. fw(w) dw=aE(W)+b-1=aE(W) +b. 


Example 
3.5.9 


Example 
3.5.10 
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Suppose that X is a random variable whose pdf is nonzero only for the three values 
—2,1, and 42: 


k  px(k) 
yg 2 
8 
1 
{# = 
8 
2 
4 2 
38 
1 


Let W = g(X) = X°. Verify the statement of Theorem 3.5.3 by computing E(W) 
two ways—first, by finding pw(w) and summing w- pw(w) over w and, second, by 
summing g(k) - px (k) over k. 

By inspection, the pdf for W is defined for only two values, 1 and 4: 


w (=k*)  pw(w) 


1 


kr loo] FI ool 


Taking the first approach to find E(W) gives 


1 7 
BW) = Dw: pww=1-(5)+4-(§) 


_ 29 
8 
To find the expected value via Theorem 3.5.3, we take 


5 1 2 
Elg(X)1= >: px(k) = (-2)°- rae Oe 
k 


with the sum here reducing to the answer we already found, 2. 

For this particular situation, neither approach was easier than the other. In gen- 
eral, that will not be the case. Finding pw(w) is often quite difficult, and on those 
occasions Theorem 3.5.3 can be of great benefit. = 


Suppose the amount of propellant, Y, put into a can of spray paint is a random 
variable with pdf 


fr(y)=3y’,  O<y<l 


Experience has shown that the largest surface area that can be painted by a can 
having Y amount of propellant is twenty times the area of a circle generated by a 
radius of Y ft. If the Purple Dominoes, a newly formed urban gang, have just stolen 
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their first can of spray paint, can they expect to have enough to cover a 5’ x 8’ subway 
panel with grafitti? 

No. By assumption, the maximum area (in ft’) that can be covered by a can of 
paint is described by the function 


g(Y) =20nY" 
According to the second statement in Theorem 3.5.3, though, the average value for 


g(Y) is slightly less than the desired 40 ft’: 


1 
E[g(Y)] =i 20m y? - 3y? dy 
0 


_ 60zy5 |! 
=e I 
= 120 
=a a 
Example A fair coin is tossed until a head appears. You will be given (4) dollars if that first 
3.5.11 head occurs on the kth toss. How much money can you expect to be paid? 


Let the random variable X denote the toss at which the first head appears. Then 
Dx(k) = P(X =k) = PCst k — 1 tosses are tails and kth toss is a head) 


iyo 4 
("4 


Moreover, 


i\* 
Ecanon won =| (5) ]-euon= Ze: 


2 allk 


Example In one of the early applications of probability to physics, James Clerk Maxwell 
3.5.12 (1831-1879) showed that the speed S of a molecule in a perfect gas has a density 


function given by 
Gy 3a 
fs(s)=4,/ —s°e“ ,  s>0 
1 


Example 
3.5.13 
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where a is a constant depending on the temperature of the gas and the mass of the 
particle. What is the average energy of a molecule in a perfect gas? 

Let m denote the molecule’s mass. Recall from physics that energy (W), mass 
(m), and speed (S) are related through the equation 


| 
W= gms = g(S) 


To find E(W) we appeal to the second part of Theorem 3.5.3: 


Ew)= | 8(s) fs(s) ds 


We make the substitution t = as?. Then 
oe) 
m 
E(W) = ——= / pee" at 
aT Jo 


But 
1 
/ pre 'dt= (5) (5) /m (see Section 4.4.6) 
0 


sO 


3 1 
E (energy) = E(W) = ae (5) (5) Ja 
_ 3m 
== 7 


Consolidated Industries is planning to market a new product and they are trying to 
decide how many to manufacture. They estimate that each item sold will return a 
profit of m dollars; each one not sold represents an n-dollar loss. Furthermore, they 
suspect the demand for the product, V, will have an exponential distribution, 


fy(v)= (5) or, sO 


How many items should the company produce if they want to maximize their 
expected profit? (Assume that n,m, and A are known.) 

If a total of x items are made, the company’s profit can be expressed as a 
function Q(v), where 
mv—n(x—v) ifvu<x 
mx ifvu>x 


ow=| 


and v is the number of items sold. It follows that their expected profit is 


eiown= | Q(v)- fy(v) dv 


= [ton + mv—ne) (=) e wh av [ms (5) edgy. CST) 
0 Xr ye xr 
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The integration here is straightforward, though a bit tedious. Equation 3.5.7 eventu- 
ally simplifies to 
E[Q(V)]=A- (m+n) —A-(m+nje*" — nx 
To find the optimal production level, we need to solve dE[Q(V)]/dx =0 for x. But 
dE([Q(V)] 


dx 
and the latter equals zero when 


iat = 
x=-iA-In 
m+n - 


Example A point, y, is selected at random from the interval [0, 1], dividing the line into two 


3.5.14 segments (see Figure 3.5.5). What is the expected value of the ratio of the shorter 
segment to the longer segment? 


=(m+n)e*"*—n 


oO 
Nie + 
e 


Figure 3.5.5 


Notice, first, that the function 


shorter segment 
a = 


~ longer segment 
has two expressions, depending on the location of the chosen point: 
(l—y), Os 
3(Y)= y/-y ; 
(—-y)/y, 3< 


By assumption, fy(y)=1,0<y<1,so 


1 


_ftiy Ley 
E[g(Y)J= Toy -Ldy+ —-ldy 
0 y ; 


Writing the second integrand as (1/y — 1) gives 


11-y 1 1 
/ —1ay=f (5-1)4y=any-» 
ae, 7 NS 


2 2 
1 
=In2-—— 
2 


1 


1 


vi 


By symmetry, though, the two integrals are the same, so 


shorter segment 
E | ——————_}=21n2-1 
longer segment 


= 0.39 


On the average, then, the longer segment will be a little more than 25 times the 
length of the shorter segment. a 


Questions 


3.5.28. Suppose X is a binomial random variable with 
n= 10and p= z, What is the expected value of 3X — 4? 


3.5.29. A typical day’s production of a certain electronic 
component is twelve. The probability that one of these 
components needs rework is 0.11. Each component need- 
ing rework costs $100. What is the average daily cost for 
defective components? 


3.5.30. Let Y have probability density function 


fr) =20—y), O< ysl 
Suppose that W = Y’, in which case 


1 
ie a ile O0<w<l 


Find E(W) in two different ways. 


3.5.31. A tool and die company makes castings for steel 
stress-monitoring gauges. Their annual profit, Q, in hun- 
dreds of thousands of dollars, can be expressed as a 
function of product demand, y: 


O(y) =2(1—-e”) 
Suppose that the demand (in thousands) for their castings 
follows an exponential pdf, fy(y) = 6e~°”, y > 0. Find the 
company’s expected profit. 


3.5.32. A box is to be constructed so that its height is 
five inches and its base is Y inches by Y inches, where 
Y is a random variable described by the pdf, fy(y) = 
6y(1 — y),0< y <1. Find the expected volume of the box. 


3.5.33. Grades on the last Economics 301 exam were not 
very good. Graphed, their distribution had a shape similar 
to the pdf 


1 
= — (100 - 1 
fr) = pq ll00—»), Os y = 100 
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As a way of “curving” the results, the professor announces 
that he will replace each person’s grade, Y, with a 
new grade, g(Y), where g(Y) = 10./Y. Will the profes- 
sor’s strategy be successful in raising the class average 
above 60? 


3.5.34. If Y has probability density function 


fr(Qy)=2y, O<y<1 


then E(Y) = 2. Define the random variable W to be the 


squared deviation of Y from its mean, that is, W=(Y — ae 
Find E(W). 


3.5.35. The hypotenuse, Y, of the isosceles right triangle 
shown is a random variable having a uniform pdf over 
the interval [6, 10]. Calculate the expected value of the 
triangle’s area. Do not leave the answer as a function 
of a. 


3.5.36. An urn contains n chips numbered 1 through n. 
Assume that the probability of choosing chip 7 is equal 
to ki,i=1,2,...,n. If one chip is drawn, calculate E(<), 
where the random variable X denotes the number show- 
ing on the chip selected. [Hint: Recall that the sum of the 
first n integers is n(n + 1)/2.] 


We saw in Section 3.5 that the location of a distribution is an important characteristic 
and that it can be effectively measured by calculating either the mean or the median. 
A second feature of a distribution that warrants further scrutiny is its dispersion— 
that is, the extent to which its values are spread out. The two properties are totally 
different: Knowing a pdf’s location tells us absolutely nothing about its dispersion. 
Table 3.6.1, for example, shows two simple discrete pdfs with the same expected 
value (equal to zero), but with vastly different dispersions. 


Table 3.6.1 


k Px, (k) k Px, (k) 
—1 i —1,000,000 i 
1 i 1,000,000 i 
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It is not immediately obvious how the dispersion in a pdf should be quantified. 
Suppose that X is any discrete random variable. One seemingly reasonable approach 
would be to average the deviations of X from their mean—that is, calculate the 
expected value of X — yu. As it happens, that strategy will not work because the 
negative deviations will exactly cancel the positive deviations, making the numerical 
value of such an average always zero, regardless of the amount of spread present in 


Px (k): 
E(X —p)=E(X)-w=y"u-yw=0 (3.6.1) 


Another possibility would be to modify Equation 3.6.1 by making all the devia- 
tions positive—that is, to replace E(X — jz) with E(|X — pw). This does work, and 
it is sometimes used to measure dispersion, but the absolute value is somewhat 
troublesome mathematically: It does not have a simple arithmetic formula, nor is 
it a differentiable function. Squaring the deviations proves to be a much better 
approach. 


Definition 3.6.1. The variance of a random variable is the expected value of its 
squared deviations from w. If X is discrete, with pdf px (k), 


Var(X) = 07 = E[(X — wy" ]= (k= pw)” px(k) 
all k 


If Y is continuous, with pdf fy (y), 
Var() <0? = EL = w)"1= f= fr ay 


[If E(X*) or E(Y?) is not finite, the variance is not defined. ] 


Comment One unfortunate consequence of Definition 3.6.1 is that the units for 
the variance are the square of the units for the random variable: If Y is measured 
in inches, for example, the units for Var(Y) are inches squared. This causes obvi- 
ous problems in relating the variance back to the sample values. For that reason, 
in applied statistics, where unit compatibility is especially important, dispersion is 
measured not by the variance but by the standard deviation, which is defined to be 
the square root of the variance. That is, 


Sik By spytk) if X is discrete 


fut / lk 
o = standard deviation = z 


CO 
if (y—p)*- fy(y) dy if Y is continuous 
—0o 


Comment The analogy between the expected value of a random variable and the 
center of gravity of a physical system was pointed out in Section 3.5. A similar equiv- 
alency holds between the variance and what engineers call a moment of inertia. If 
a set of weights having masses m,,m2,... are positioned along a (weightless) rigid 
bar at distances r;,r2,... from an axis of rotation (see Figure 3.6.1), the moment of 
inertia of the system is defined to be value )~ m;r?. Notice, though, that if the masses 


L 
were the probabilities associated with a discrete random variable and if the axis of 
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rotation were actually ju, then r;,7r2,... could be written (k; — 2), (ko — ),... and 
>> mir? would be the same as the variance, )* (k — 1) - px(k). 
i all k 


Axis of My 
m . 
1 uN m; 
( es ) 
a n 
re: 
rr, 
Figure 3.6.1 


Definition 3.6.1 gives a formula for calculating o” in both the discrete and 
the continuous cases. An equivalent—but easier-to-use—formula is given in The- 


orem 3.6.1. 
Theorem Let W be any random variable, discrete or continuous, having mean «z and for which 
3.6.1 E(W?) is finite. Then 


Var(W) = 07 = E(W’) — 


Proof We will prove the theorem for the continuous case. The argument for discrete 
W is similar. In Theorem 3.5.3, let g(W) = (W — yw). Then 


Var(W) = EL(W — 2)71= / g(w) fw(w) dw = i: (w— pb)? fy (w) dw 


Squaring out the term (w — 4)? that appears in the integrand and using the additive 
property of integrals gives 


/ (w—p)* fw(w) dw = i (w°—2uw+p’) fw (w) dw 


= l w* fw(w) dw—2u / wfv(w) dw Ww fw(w) dw 


= E(W?)—2y?+p? = E(W?)—p? 


Note that the equality [°° w? fy(w) dw = E(W”) also follows from Theorem 3.5.3. 


Example An urn contains five chips, two red and three white. Suppose that two are drawn 
3.6.1 out at random, without replacement. Let X denote the number of red chips in the 
sample. Find Var(X). 
Note, first, that since the chips are not being replaced from drawing to drawing, 
X is a hypergeometric random variable. Moreover, we need to find jz, regardless of 
which formula is used to calculate o?. In the notation of Theorem 3.5.2, r =2, w=3, 
and n =2,so0 


wM=rn/(r+w)=2-2/(2+3)=0.8 
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To find Var(X) using Definition 3.6.1, we write 


Var(X) = EL(X — 4)"]= oe — 1): fx) 
all x 

2) (3 2) (3 2 3 

=> (0 0.8)* . (0) (;) + ad 0.8)* F (i) (i) + (2 0.8)? z (;) (5) 


5 5 5 
2 2 2, 


= 0.36 
To use Theorem 3.6.1, we would first find E(X?). From Theorem 3.5.3, 


ye cope 80,2 QE.» QQ 
sa aa eae 


+17. +2?. 


= 1.00 
Then 
Var(X) = E(X*) — pu? = 1.00 — (0.8)7 
= 0.36 
confirming what we calculated earlier. rT] 


In Section 3.5 we encountered a change of scale formula that applied 
to expected values. For any constants a and b and any random variable W, 
E(aW +b)=aE(W) +5). A similar issue arises in connection with the variance of 
a linear transformation: If Var(W) = 07, what is the variance of aW + b? 


Theorem Let W be any random variable having mean ww and where E(W°) is finite. Then 
3.6.2 Var(aW +b) =aVar(W). 


Proof Using the same approach taken in the proof of Theorem 3.6.1, it can be 
shown that E[(aW + b)?] = a?E(W7) + 2abu + b*. We also know from the corol- 
lary to Theorem 3.5.3 that E(aW +b) =ayu +b. Using Theorem 3.6.1, then, we can 
write 


Var(aW + b) = E[(aW +b)*]— [E(aW +b)} 
= [a E(W’) + 2abu +b?) — [au + bP 
= [a E(W*) + 2abu +b?) — [a2 pe? + 2abp +b?) 
=a*[E(W’) — 7] =a’Var(W) 


Example A random variable Y is described by the pdf 
3.6.2 
JAVS2y, Veys) 


What is the standard deviation of 3Y +2? 
First, we need to find the variance of Y. But 


1 
2 
0 3 
and 


1 
1 
EQ?) = | y’-2ydy== 
0 2 


sO 


3.6 The Variance 159 


a ee eee 
Var(Y) = E(Y~) w=5 (5) 

antl 

~ 18 


Then, by Theorem 3.6.2, 


Var(3Y +2) = (3)?- Var(Y) =9- i 


18 
i 
~o 
which makes the standard deviation of 3Y + 2 equal to re or 0.71. = 


Questions 


3.6.1. Find Var(X) for the urn problem of Example 3.6.1 
if the sampling is done with replacement. 


3.6.2. Find the variance of Y if 
7, Daye! 
1 


fr= 


2<yx<3 


0, elsewhere 


3.6.3. Ten equally qualified applicants, six men and four 
women, apply for three lab technician positions. Unable to 
justify choosing any of the applicants over all the others, 
the personnel director decides to select the three at ran- 
dom. Let X denote the number of men hired. Compute 
the standard deviation of X. 


3.6.4. Compute the variance for a uniform random vari- 
able defined on the unit interval. 


3.6.5. Use Theorem 3.6.1 to find the variance of the 
random variable Y, where 
frov=3(0l-yy, O<y<!l 


3.6.6. If 
2 
fr) = = O<y<k 


for what value of k does Var(Y) =2? 


3.6.7. Calculate the standard deviation, o, for the random 
variable Y whose pdf has the graph shown below: 


R 


yO) 


me 


3.6.8. Consider the pdf defined by 
2 


Show that (a) [°° fr(y) dy =1, (b)E(Y) =2, and (c) Var(Y) 
is not finite. 


3.6.9. Frankie and Johnny play the following game. 
Frankie selects a number at random from the interval 
[a,b]. Johnny, not knowing Frankie’s number, is to pick 
a second number from that same inverval and pay Frankie 
an amount, W, equal to the squared difference between 
the two [so 0 < W < (b—a)’]. What should be Johnny’s 
strategy if he wants to minimize his expected loss? 


3.6.10. Let Y be a random variable whose pdf is given by 
fr(y) =5y*,0< y <1. Use Theorem 3.6.1 to find Var(Y). 


3.6.11. Suppose that Y is an exponential random variable, 
so fy(y)=Ae~’, y > 0. Show that the variance of Y is 1/A?. 


3.6.12. Suppose that Y is an exponential random variable 
with A = 2 (recall Question 3.6.11). Find P[Y > E(Y)+ 


2,/ Var(Y )]. 


3.6.13. Let X be a random variable with finite mean wp. 
Define for every real number a, g(a) = E[(X — a)’]. Show 
that 


g(a) = E[(X — p)"]+ (ua). 
What is another name for min g(a)? 


3.6.14. Let Y have the pdf given in Question 3.6.5. Find 
the variance of W, where W=—5Y +4 12. 


3.6.15. If Y denotes a temperature recorded in degrees 
Fahrenheit, then (Y — 32) is the corresponding tempera- 
ture in degrees Celsius. If the standard deviation for a set 
of temperatures is 15.7°F, what is the standard deviation 
of the equivalent Celsius temperatures? 
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3.6.16. If E(W) = and Var(W) =o’, show that 


E (“*)=0 and var (=") =1 
o oO 


3.6.17. Suppose U is a uniform random variable over 
[0, 1]. 


(a) Show that Y = (b—a)U +a is uniform over [a, b]. 
(b) Use part (a) and Question 3.6.4 to find the variance 
of Y. 


3.6.18. Recovering small quantities of calcium in the pres- 
ence of magnesium can be a difficult problem for an 
analytical chemist. Suppose the amount of calcium Y to 
be recovered is uniformly distributed between 4 and 7 mg. 


Higher Moments 


The amount of calcium recovered by one method is the 
random variable 


W, = 0.2281 + (0.9948)Y + E, 


where the error term E, has mean 0 and variance 0.0427 
and is independent of Y. 
A second procedure has random variable 


W, = —0.0748 + (1.0024)Y + E, 


where the error term E, has mean 0 and variance 0.0159 
and is independent of Y. 

The better technique should have a mean as close as 
possible to the mean of Y(=5.5), and a variance as small as 
possible. Compare the two methods on the basis of mean 
and variance. 


The quantities we have identified as the mean and the variance are actually spe- 
cial cases of what are referred to more generally as the moments of a random 
variable. More precisely, E(W) is the first moment about the origin and o? is the 
second moment about the mean. As the terminology suggests, we will have occasion 
to define higher moments of W. Just as E(W) and o? reflect a random variable’s 
location and dispersion, so it is possible to characterize other aspects of a distri- 
bution in terms of other moments. We will see, for example, that the skewness of a 
distribution —that is, the extent to which it is not symmetric around j.—can be effec- 
tively measured in terms of a third moment. Likewise, there are issues that arise in 
certain applied statistics problems that require a knowledge of the flatness of a pdf, 
a property that can be quantified by the fourth moment. 


positive integer r, 


Definition 3.6.2. Let W be any random variable with pdf fw(w). For any 


1. The rth moment of W about the origin, j1,, is given by 


provided [° |w|" - fw(w) dw < oo (or provided the analogous condition 
on the summation of |w|" holds, if W is discrete). When r = 1, we usually 
delete the subscript and write E(W) as uw rather than j1). 

2. The rth moment of W about the mean, w’., is given by 


provided the finiteness conditions of part 1 hold. 


Mr = E(W') 


wl = El(W— py’) 


Comment We can express jz’, in terms of w;, j =1,2,...,7, by simply writing out 
the binomial expansion of (W — 2)": 


r 


w= EUW p= >> ({Jeory—w 


j=0 


Example 
3.6.3 


Theorem 
3.6.3 
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Thus, 

15 = EW — w)"]=07 = po - 

ws = E[(W — w)*]= 3 — 3pm +2} 

11), = EL(W = )*] = pa — 4 pris + Oj M2 — 3} 
and so on. 


The skewness of a pdf can be measured in terms of its third moment about the mean. 
If a pdf is symmetric, E[(W — j2)*] will obviously be zero; for pdfs not symmetric, 
E[(W — y)?] will not be zero. In practice, the symmetry (or lack of symmetry) of a 
pdf is often measured by the coefficient of skewness, y,, where 
_ EUW 3] 
=.= 
3 
Dividing 1, by o? makes y; dimensionless. 
A second “shape” parameter in common use is the coefficient of kurtosis, ya, 
which involves the fourth moment about the mean. Specifically, 
_ EW) | 


7 3 


Y2 
oO 


For certain pdfs, yz is a useful measure of peakedness: relatively flat pdfs are said to 
be platykurtic; more peaked pdfs are called leptokurtic. = 


Earlier in this chapter we encountered random variables whose means do not 
exist—recall, for example, the St. Petersburg paradox. More generally, there are 
random variables having certain of their higher moments finite and certain others, 
not finite. Addressing the question of whether or not a given E(W’) is finite is the 
following existence theorem. 

If the kth moment of a random variable exists, all moments of order less than k exist. 
Proof Let fy(y) be the pdf of a continuous random variable Y. By Definition 3.6.2, 
E(Y*) exists if and only if 


; Iv: fr(y) dy < 00 (3.6.2) 


(oe) 


Let 1 < j <k. To prove the theorem we must show that 


/ Iv!’ - fry) dy < oo 
is implied by Inequality 3.6.2. But 


/ lvl’: fy) a= [IF fr) ay f hse) dy 
—oo yl< y|> 
< " tron dy f IVP: fr) dy 
yis yl> 


<i+[ Gia 
|y|>1 


<i+/ yl fr) dy <00 
|y|>1 
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Therefore, E(Y/) exists, j= 1,2,...,k —1. The proof for discrete random variables 


is similar. 


Questions 


3.6.19. Let Y be a uniform random variable defined over 
the interval (0, 2). Find an expression for the rth moment 
of Y about the origin. Also, use the binomial expansion as 
described in the Comment to find E[(Y — 2)°]. 


3.6.20. Find the coefficient of skewness for an exponen- 
tial random variable having the pdf 


frQ)=e”, 


3.6.21. Calculate the coefficient of kurtosis for a uniform 
random variable defined over the unit interval, fy(y) = 1, 
forO<y<l. 


y>0 


3.6.22. Suppose that W is a random variable for which 
E[(W — w)?]= 10 and E(W?) =4. Is it possible that 4 = 2? 


3.6.23. If Y=aX +b, a > 0, show that Y has the same 
coefficients of skewness and kurtosis as X. 


3.7 Joint Densities 


3.6.24. Let Y be the random variable of Question 3.4.6, 
where for a positive integer n, fy(y) =(n+2) (n+1)y"(1— 
y),O<y<l. 


(a) Find Var(Y). 
(b) For any positive integer k, find the kth moment 
around the origin. 


3.6.25. Suppose that the random variable Y is described 
by the pdf 
froy=e-y*, y>l 


(a) Find c. 
(b) What is the highest moment of Y that exists? 


Sections 3.3 and 3.4 introduced the basic terminology for describing the probabilis- 
tic behavior of a single random variable. Such information, while adequate for many 
problems, is insufficient when more than one variable are of interest to the exper- 
imenter. Medical researchers, for example, continue to explore the relationship 
between blood cholesterol and heart disease, and, more recently, between “good” 
cholesterol and “bad” cholesterol. And more than a little attention—both political 
and pedagogical—is given to the role played by K-12 funding in the performance 
of would-be high school graduates on exit exams. On a smaller scale, electronic 
equipment and systems are often designed to have built-in redundancy: Whether 
or not that equipment functions properly ultimately depends on the reliability of 


two different components. 


The point is, there are many situations where two relevant random variables, 
say, X and Y,? are defined on the same sample space. Knowing only fx(x) and 
fy(y), though, does not necessarily provide enough information to characterize the 
all-important simultaneous behavior of X and Y. The purpose of this section is to 
introduce the concepts, definitions, and mathematical techniques associated with 
distributions based on two (or more) random variables. 


Discrete Joint Pdfs 


As we saw in the single-variable case, the pdf is defined differently depending on 
whether the random variable is discrete or continuous. The same distinction applies 


2 For the next several sections we will suspend our earlier practice of using X to denote a discrete random 
variable and Y to denote a continuous random variable. The category of the random variables will need to be 
determined from the context of the problem. Typically, though, X and Y will either be both discrete or both 


continuous. 
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to joint pdfs. We begin with a discussion of joint pdfs as they apply to two discrete 
random variables. 


Definition 3.7.1. Suppose S is a discrete sample space on which two random 
variables, X and Y, are defined. The joint probability density function of X and 
Y (or joint pdf) is denoted py y(x, y), where 


Px,y(%,y)=P({s|X(s)=x and Y(s)=y}) 


Comment A convenient shorthand notation for the meaning of py y(x, y), consis- 
tent with what we used earlier for pdfs of single discrete random variables, is to write 
Pxy(X,y)= P(X=x,Y=y). 


Example A supermarket has two express lines. Let X and Y denote the number of customers 
3.7.1 in the first and in the second, respectively, at any given time. During nonrush hours, 
the joint pdf of X and Y is summarized by the following table: 
x 


0 1 2 3 


0 0.1 02 O 0 

1 0.2 0.25 0.05 0 

2 0 0.05 0.05 0.025 
3 0 O 0.025 0.05 


Find P(|X — Y|=1), the probability that X and Y differ by exactly 1. 
By definition, 


P(X-Y|=)= SY) Yopxy@y) 
|x-yl=1 
= px,y(0, 1) + px,y (1, 0) + pxy(, 2) 
+ pxy(2, 1) + px,y (2, 3) + px,y (3, 2) 
=0.2+0.2+ 0.05 + 0.05 + 0.025 + 0.025 


=0.55 
[Would you expect px.y(x, y) to be symmetric? Would you expect the event |X — 
Y| > 2 to have zero probability?] = 
Example Suppose two fair dice are rolled. Let X be the sum of the numbers showing, and let 
3.7.2 Y be the larger of the two. So, for example, 


Px y(2,3)= P(X =2, Y=3)= P(@)=0 


2 
Pxy(4, 3) = P(X =4, Y=3)= P01, 3G, D)p= 36 


and 


1 
Per G3)— P(X =O. 3) = PUG Sree 


The entire joint pdf is given in Table 3.7.1. 
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Theorem 
3.7.1 


Table 3.7.1 


: y 1 2 3 4 5 6 Row totals 
2 | 1/36 0 0 0 0 0 1/36 
3 0 2/36 0 0 0 0 2/36 
4 0 1/36 2/36 0 0 0 3/36 
5 0 0 2/36 2/36 0 0 4/36 
6 0 0 1/36 2/36 2/36 0 5/36 
7 0 0 0 2/36 2/36 2/36 6/36 
8 0 0 0 1/36 2/36 2/36 5/36 
9 0 0 0 0 2/36 2/36 4/36 

10 0 0 0 0 1/36 2/36 3/36 
14 0 0 0 0 0 2/36 2/36 
12 0 0 0 0 0 1/36 1/36 


Col. totals 1/36 3/36 5/36 7/36 9/36 11/36 


Notice that the row totals in the right-hand margin of the table give the pdf 
for X. Similarly, the column totals along the bottom detail the pdf for Y. Those 
are not coincidences. Theorem 3.7.1 gives a formal statement of the relationship 
between the joint pdf and the individual pdfs. o 


Suppose that px y(x, y) is the joint pdf of the discrete random variables X and Y. 
Then 


px(x)=)— pxy(x,y) and py(y)= > px.y(x,y) 


ally allx 


Proof We will prove the first statement. Note that the collection of sets (Y = y) for 
all y forms a partition of S; that is, they are disjoint and L,,,(Y = y) = S. The set 
(X=x)=(X=x)NS=(X=x)NUayV =y) =Uay lX =x) (YY = y)], so 


px(x) = P(X=x)=P || Jix=xn(v=y)! 
ally 


=) P(X=x,Y=y)=) pxy@,y) 


ally ally 


Definition 3.7.2. An individual pdf obtained by summing a joint pdf over all 
values of the other random variable is called a marginal pdf. 


Continuous Joint Pdfs 


If X and Y are both continuous random variables, Definition 3.7.1 does not apply 
because P(X =x, Y = y) will be identically 0 for all (x, y). As was the case in single- 
variable situations, the joint pdf for two continuous random variables will be defined 
as a function that when integrated yields the probability that (X, Y) lies in a specified 
region of the xy-plane. 


Example 
3.7.3 


Example 
3.7.4 
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Definition 3.7.3. Two random variables defined on the same set of real num- 
bers are jointly continuous if there exists a function fy y(x, y) such that for any 
region R in the xy-plane, P[(X, Y) € R]=f fp fx.y(x, y) dx dy. The function 
fx.y(x, y) is the joint pdf of X and Y. 


Comment Any function fy, (x, y) for which 


1. fxy(@,y)>O0 for allx and y 
[o.@) oe) 
2 ff fers ydxay=1 


WOds—-C 
qualifies as a joint pdf. We shall employ the convention of naming the domain only 
where the joint pdf is nonzero; everywhere else it will be assumed to be zero. This is 
analogous, of course, to the notation used earlier in describing the domain of single 
random variables. 


Suppose that the variation in two continuous random variables, X and Y, can be 
modeled by the joint pdf fy y(x, y) =cxy, for0<y <x <1. Findc. 

By inspection, fy y(x, y) will be nonnegative as long as c > 0. The particular 
c that qualifies fy y(x, y) as a joint pdf, though, is the one that makes the volume 
under fy y(x, y) equal to 1. But 


1 x 1 ear 
i exy dy dv=t=e [| (xy) dy]ax=e f (2 )ax 
s o LJo 0 2 Io 
Mf xe d x* 1 1 
=f (Z)e=-Fl,=(6)¢ 


Therefore, c= 8. | 


A study claims that the daily number of hours, X, a teenager watches television and 
the daily number of hours, Y, he works on his homework are approximated by the 
joint pdf 

—(x+y) 


fx,y(*, y) =xye x>0, y>0 


What is the probability that a teenager chosen at random spends at least twice as 
much time watching television as he does working on his homework? 

The region, R, in the xy-plane corresponding to the event “X > 2Y” is shown 
in Figure 3.7.1. It follows that P(X > 2Y) is the volume under fy y(x, y) above the 
region R: 


co px/2 
P(X >2Y) =I / xye"*™ dy dx 
o Jo 


Separating variables, we can write 


oe) x/2 
P(X >2Y) = xe * | veray| dx 
0 0 
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Example 


3.7.5 


Figure 3.7.1 


and the double integral reduces to +: 


[o.@) 
= —Xx _ ~ —x/2 
P(X >2Y)= xe 1 ia e dx 
0 


Geometric Probability 


One particularly important special case of Definition 3.7.3 is the joint uniform pdf, 
which is represented by a surface having a constant height everywhere above a 
specified rectangle in the xy-plane. That is, 


1 
9 = 7 ee < <b, s <d 
fx,y(*, y) eT ae asx c<y 


If R is some region in the rectangle where X and Y are defined, P((X, Y) € R) 
reduces to a simple ratio of areas: 
area of R 


P(X, Y)ER)= iGo (3.7.1) 


Calculations based on Equation 3.7.1 are referred to as geometric probabilities. 


Two friends agree to meet on the University Commons “sometime around 12:30.” 
But neither of them is particularly punctual—or patient. What will actually happen 
is that each will arrive at random sometime in the interval from 12:00 to 1:00. If one 
arrives and the other is not there, the first person will wait fifteen minutes or until 
1:00, whichever comes first, and then leave. What is the probability that the two will 
get together? 

To simplify notation, we can represent the time period from 12:00 to 1:00 as the 
interval from zero to sixty minutes. Then if x and y denote the two arrival times, the 
sample space is the 60 x 60 square shown in Figure 3.7.2. Furthermore, the event 
M, “The two friends meet,” will occur if and only if |x — y| < 15 or, equivalently, 


Example 
3.7.6 
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0 (15,0) 60 


Figure 3.7.2 


if and only if —15 <x — y < 15. These inequalities appear as the shaded region in 
Figure 3.7.2. 
Notice that the areas of the triangles above and below M are each equal to 
$(45)(45). It follows that the two friends have a 44% chance of meeting: 
area of M 


P(M) = ———— 
ae area of S 


__ (60)? — 2[ 5 (45) (45)] 
7 (60)? 
= 0.44 = 


A carnival operator wants to set up a ringtoss game. Players will throw a ring 
of diameter d onto a grid of squares, the side of each square being of length s 
(see Figure 3.7.3). If the ring lands entirely inside a square, the player wins a 
prize. To ensure a profit, the operator must keep the player’s chances of winning 
down to something less than one in five. How small can the operator make the 
ratio d/s? 


Figure 3.7.4 
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First, assume that the player is required to stand far enough away that no skill is 
involved and the ring is falling at random on the grid. From Figure 3.7.4, we see that 
in order for the ring not to touch any side of the square, the ring’s center must be 
somewhere in the interior of a smaller square, each side of which is a distance d/2 


from one of the grid lines. 


Since the area of a grid square is s” and the area of an interior square is (s — d)’, 
the probability of a winning toss can be written as the ratio: 


P (Ring touches no lines) = 


(s—d) 
g2 


But the operator requires that 


Solving for d/s gives 


(s —d)* 
g2 


<0.20 


d 
—>1-—v0.20=0.55 
Ss 


That is, if the diameter of the ring is at least 55% as long as the side of one of the 
squares, the player will have no more than a 20% chance of winning. o 


Questions 


3.7.1. If pxy(x, y) =cxy at the points (1, 1), (2, 1), (2, 2), 
and (3, 1), and equals 0 elsewhere, find c. 


3.7.2. Let X and Y be two continuous random vari- 
ables defined over the unit square. What does c equal if 
fuy&, y) =c(x? + y?)? 


3.7.3. Suppose that random variables X and Y vary in 
accordance with the joint pdf, fy. yx, y)=c(x+ y),0<x< 
y <1. Findc. 


3.7.4. Find c if fx. y(x, y) =cxy for X and Y defined over 
the triangle whose vertices are the points (0, 0), (0, 1), and 


(1,1). 


3.7.5. An urn contains four red chips, three white chips, 
and two blue chips. A random sample of size 3 is drawn 
without replacement. Let X denote the number of white 
chips in the sample and Y the number of blue chips. Write 
a formula for the joint pdf of X and Y. 


3.7.6. Four cards are drawn from a standard poker deck. 
Let X be the number of kings drawn and Y the number of 
queens. Find py y(x, y). 


3.7.7. An advisor looks over the schedules of his fifty stu- 
dents to see how many math and science courses each has 
registered for in the coming semester. He summarizes his 
results in a table. What is the probability that a student 
selected at random will have signed up for more math 
courses than science courses? 


Number of math courses, X 
0 1 2 


Number 0 11 6 4 

of science 

courses, Y 1 9 10 
2 


Nn 
=) 
NW 


3.7.8. Consider the experiment of tossing a fair coin three 
times. Let X denote the number of heads on the last flip, 
and let Y denote the total number of heads on the three 
flips. Find py y(x, y). 


3.7.9. Suppose that two fair dice are tossed one time. Let 
X denote the number of 2’s that appear, and Y the number 
of 3’s. Write the matrix giving the joint probability density 
function for X and Y. Suppose a third random variable, Z, 
is defined, where Z = X + Y. Use py y(x, y) to find pz(z). 


3.7.10. Suppose that X and Y have a bivariate uniform 
density over the unit square: 


c, O<x<l, O<y<l 


xy, »=| 


0, elsewhere 


(a) Find c. 
(b) Find P(0< X <},0<Y<}). 


3.7.11. Let X and Y have the joint pdf 
fxy@,y)=2e°™, 
Find P(Y <3X). 


3.7.12. A point is chosen at random from the interior of 
a circle whose equation is x7 + y? < 4. Let the random 
variables X and Y denote the x- and y-coordinates of the 
sampled point. Find fx y(x, y). 


3.7.13. Find P(X < 2Y) if fyy(x,y)=x+y for X and Y 
each defined over the unit interval. 


O<x<y, 0<y 
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3.7.14. Suppose that five independent observations are 
drawn from the continuous pdf f;(t) =2t,0<r<1.Let X 
denote the number of t’s that fall in the interval 0 <t < ; 
and let Y denote the number of t’s that fall in the interval 
+ <t <4. Find py y(1, 2). 


3.7.15. A point is chosen at random from the interior 
of a right triangle with base b and height h. What is the 
probability that the y value is between 0 and h/2? 


Marginal Pdfs for Continuous Random Variables 


The notion of marginal pdfs in connection with discrete random variables was intro- 
duced in Theorem 3.7.1 and Definition 3.7.2. An analogous relationship holds in 
the continuous case—integration, though, replaces the summation that appears in 


Theorem 3.7.1. 


Theorem 
3.7.2 


Suppose X and Y are jointly continuous with joint pdf fx.y(x, y). Then the marginal 
pdfs, fx(x) and fy(y), are given by 


fos) = | fx.y(x,y) dy and fro= f fx.y (x, y) dx 


Proof It suffices to verify the first of the theorem’s two equalities. As is often the 
case with proofs for continuous random variables, we begin with the cdf: 


Fre(ay= PX sx) = f [ fx.y(t, y) dt ay=[ / fx.y (x, y) dy dt 


Differentiating both ends of the equation above gives 


(recall Theorem 3.4.1). 


Example 
3.7.7 


Find f(x). 


1 
Fxv@, Y= e, O<x <3, O<y<2 


fe) = | far, dy 


Suppose that two continuous random variables, X and Y, have the joint uniform pdf 


Applying Theorem 3.7.2 gives 


2 2 
fes)=f Fx.y(*, y) ay= [ zdy= 


Notice that X, by itself, is a uniform random variable defined over the interval [0, 3]; 
similarly, we would find that fy(y) has a uniform pdf over the interval [0, 2]. = 


Example 
3.7.8 


Consider the case where X and Y are two continuous random variables, jointly 
distributed over the first quadrant of the xy-plane according to the joint pdf, 
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Ixy 


Find the two marginal pdfs. 


First, consider f(x). 


pyre VF | x > 0, y > 0 
(X, y)= 0, elsewhere 
By Theorem 3.7.2, 


oe lo.@) 
fecx)= | fre, yydy= f ye %t) dy 
+e 0 


In the integrand, substitute 


u=y(x+1) 


making du = (x + 1) dy. This gives 


uz 


fx) = a 


1 [oe] 
ch 


—u 1 - 2-u 
(e+ i2° du=—— | ue du 


After applying integration by parts (twice) to [> u?e" du, we get 


fx) 


1 2-u —u —uy|% 
=Gaie [ uve 2ue 2e No 
1 0 Ii uz i 2u nm 2 
=> 1m 
(x + 1)3 u>oo \ el eu el 
2 
~ (we +1)?’ 


Finding fy(y) is a bit easier: 


[o.@) oO 
Hoje / foe / ye dy 
—o0o 0 


Questions 


3.7.16. Find the marginal pdf of X for the joint pdf derived 
in Question 3.7.5. 


3.7.17. Find the marginal pdfs of X and Y for the joint pdf 
derived in Question 3.7.8. 


3.7.18. The campus recruiter for an international con- 
glomerate classifies the large number of students she inter- 
views into three categories—the lower quarter, the middle 
half, and the upper quarter. If she meets six students on a 
given morning, what is the probability that they will be 
evenly divided among the three categories? What is the 


“ 1 
ye? / e*dx=y’e”? (<) (--" 
0 y 


(oe) 


) 


0 


marginal probability that exactly two will belong to the 
middle half? 


3.7.19. For each of the following joint pdfs, find fy (x) and 
fr). 

(a) fxy@,y)=}3,0<x<2,0<y<1 

(b) fxv@, y) =3y,0<x<2,0<yK<l 

(c) fxr(x, y) = 3(«+2y),0<x<1,0<y<l 

(d) fxy@, yy=c@+y),O<x<10<y<l 

(e) fxy@, y)=4xy,0<x<1,0<y<l 

() fxy(x,y)=xye,0<x,0<y 

(g) fxy@, yy=yeP,0<x,0<y 


3.7.20. For each of the following joint pdfs, find fy (x) and 
frQ). 


(a) fxy(%, y)=3,0<x<y<2 
(b) fxr, y)=+,0<y<x<1 
(c) fxy@, y)=6x,0<x<1,0<y<1—x 


3.7.21. Suppose that fy y(x, y) =6(1 — x — y) for x and y 
defined over the unit square, subject to the restriction that 
O0<x+y <1. Find the marginal pdf for X. 


3.7.22. Find fy(y) if fxy(x,y) = 2e"e for x and y 
defined over the shaded region pictured. 
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3.7.23. Suppose that X and Y are discrete random vari- 
ables with 


2 4! nua oka a5 sian 
Pur.) =a (3 (; (5 | 


O<x+y<4 


Find px (x) and py(x). 


3.7.24. A generalization of the binomial model occurs 
when there is a sequence of n independent trials with 
three outcomes, where p,; = P(outcome 1) and p) = 
P(outcome 2). Let X and Y denote the number of tri- 
als (out of 1) resulting in outcome 1 and outcome 2, 
respectively. 


(a) Show that = pxy(x, y) 
(_—p,—p2)"*%,05x+y<n 
(b) Find px(x) and py (x). 


(Hint: See Question 3.7.23.) 


For a single random variable X, the cdf of X evaluated at some point x—that is, 
Fy(x)—is the probability that the random variable X takes on a value less than or 
equal to x. Extended to two variables, a joint cdf [evaluated at the point (u, v)] is 
the probability that X <u and, simultaneously, that Y < v. 


Definition 3.7.4. Let X and Y be any two random variables. The joint cumu- 
lative distribution function of X and Y (or joint cdf) is denoted Fy y(u, v), 


Fy y(u,v)= P(X <u 


and Y<v) 


y 
ay = 
x 
0 
Joint Cdfs 
where 
Example 
3.7.9 


Find the joint cdf, Fy,y(u, v), for the two random variables X and Y whose joint pdf 
is given by fx y(x, y)= F(a +xy),0<x<10<y<1. 


If Definition 3.7.4 is applied, the probability that X <u and Y <v becomes a 


double integral of fx y(x, y): 


4 Uv u 4 v u 
Frvaw=3 f } (et ay)dxdy=5 f | bay) ds] dy 
3 0 0 3 0 0 


_4 7 a ) 
= a 


4 


u : =$ [a4 Jd 
: =. 5) y)ay 


32 


(+5) 


"Ay? fe 
= v 
0 32 2 
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Theorem 
3.7.3 


Example 


3.7.10 


which simplifies to 
1 
Fx,y(u, v) = 30° Qu +") 
[For what values of u and v is Fy y(u, v) defined? ] a 
Let Fy y(u, v) be the joint cdf associated with the continuous random variables X and 


Y. Then the joint pdf of X and Y, fx.y(x, y), is asecond partial derivative of the joint 
2 


cdf—that is, fy.y(x, y)= andy er Os y), provided Fy y(x, y) has continuous second 
Xoy 
partial derivatives. 


What is the joint pdf of the random variables X and Y whose joint cdf is Fy y(x, y)= 


sx Gy yy)? 
By Theorem 3.7.3, 
a2 2 1 ‘ ‘ 
Ixy@, y= Dx ay PEO y= ax dy a Qy+y*) 
=> A yy?) = x (24 2y) = x $29) 
dy 3 3 3 


Notice the similarity between Examples 3.7.9 and 3.7.10— fy, (x, y) is the same in 
both examples; so is Fy y(x, y). = 


Multivariate Densities 


The definitions and theorems in this section extend in a very straightforward way 
to situations involving more than two variables. The joint pdf for n discrete random 


variables, for example, is denoted py,,...x,(«1,.--,%n) where 
PRX i yeXn (X15 0005 Xn) = P(X] =X],..., Xn = Xn) 
For n continuous random variables, the joint pdf is that function fy, x, (41, ---,%n) 


having the property that for any region R in n-space, 


And if Fy,,...x,(%1,---,4») is the joint cdf of continuous random variables 
X,..., X, —that is, Fy, ,...,.x, (41, +++; %n) = P(X) <x4,..., Xn <xX,)—then 
Qo” 
Fx, oy X,(X1, «++, Xp) = ———— Fy, 53235 x Xigeses Xa) 
OX, +++ OXn 


The notion of a marginal pdf also extends readily, although in the n-variate case, a 
marginal pdf can, itself, be a joint pdf. Given X,..., X,, the marginal pdf of any sub- 
set of r of those variables (X;,, X;,,..., X;,) is derived by integrating (or summing) 
the joint pdf with respect to the remaining n — r variables (X;,, X;,,..., Xj,_,). If the 
X;’s are all continuous, for example, 


Questions 


3.7.25. Consider the experiment of simultaneously toss- 
ing a fair coin and rolling a fair die. Let X denote the 
number of heads showing on the coin and Y the number 
of spots showing on the die. 


(a) List the outcomes in S. 
(b) Find Fy y(1, 2). 


3.7.26. An urn contains twelve chips—four red, three 
black, and five white. A sample of size 4 is to be drawn 
without replacement. Let X denote the number of white 
chips in the sample, Y the number of red. Find Fy.) (1, 2). 


3.7.27. For each of the following joint pdfs, find Fy y(u, v). 
(a) fxv@,y)=3y*,0<x<2,0<y<1 


(b) fxv(x, y) = Fe +2y),0<x<1,0<yK<l 
(c) fxy@, y)=4xy,0<x<1,0<y<l 


3.7.28. For each of the following joint pdfs, find Fy,y(u, v). 
(a) fxy@,y)= 


Be 
(b) fxr@. y= y 
(c) fxr, y) =6x,0<x <1, 


1 
3,05x< 
LOo< < 
~>USYS 


<y<l-x 
3.7.29. Find and graph fyy(x,y) if the joint cdf for 
random variables X and Y is 


Fy y(x, y)=xy, O<x<l, O<y<l 


3.7.30. Find the joint pdf associated with two random 
variables X and Y whose joint cdf is 


Fyy(x,y)=(l-e* )(1-e™), 


3.7.31. Given that Fy y(x, y) = k(4x?y? + 5xy*),0 <x < 
1,0 < y <1, find the corresponding pdf and use it to 
calculate P(0< X <$,5<Y <1). 


3.7.32. Prove that 
Pia<X <b,c<Y¥<d)=Fyy(b,d)— Fy y(a,d) 
at Fy y(b, c)+ Fy y(a, c) 


x>0, y>0 
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3.7.33. A certain brand of fluorescent bulbs will last, on 
the average, 1000 hours. Suppose that four of these bulbs 
are installed in an office. What is probability that all four 
are still functioning after 1050 hours? If X; denotes the ith 
bulb’s life, assume that 


4 
1 
Ty .%5,%3,X4 (415 X2, X3, X4) = | (a0) eum 
“_; \ 1000 


for x; > 0,i=1, 2, 3,4. 


3.7.34. A hand of six cards is dealt from a standard poker 
deck. Let X denote the number of aces, Y the number of 
kings, and Z the number of queens. 


(a) Write a formula for py y.z(x, y, z). 
(b) Find px.y(x, y) and px,z(x, z). 


3.7.35. Calculate py y(0,1) if pyyz@,y,z) = 


aCe =a G) (a) ee . (as for x, y,z=0, 1, 2,3 
and0<x+y4+z<3. 


3.7.36. Suppose that the random variables X, Y, and Z 
have the multivariate pdf 


Sxy.z, y,z=(x+y)e* 


for0<x<1,0<y<1,andz>0. Find (a) fy y(x, y), (b) 
fy.zQ, 2), and (c) Fz(z). 


3.7.37. The four random variables W, X, Y, and Z have 
the multivariate pdf 


Sw.xy.z(w, x, y, Z) = lowxyz 


for0<w<1,0<x <1,0<y<1,and0<z <1. Find the 
marginal pdf, fiy.x(w, x), and use it to compute P(0< W < 
554<X <1). 


3? 


Independence of Two Random Variables 


The concept of independent events that was introduced in Section 2.5 leads quite 
naturally to a similar definition for independent random variables. 


A)P(Y € B). 


Definition 3.7.5. Two random variables X and Y are said to be independent 
if for every interval A and every interval B, P(X € A and Ye B)= P(X &€ 
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Theorem 
3.7.4 


Example 


3.7.11 


The continuous random variables X and Y are independent if and only if there are 
functions g(x) and h(y) such that 

fxy(%, y) = 8(@)h(y) (3.7.2) 
If Equation 3.7.2 holds, there is a constant k such that fx(x) =kg(x) and fy(y) = 
(1/k)h(y). 
Proof First, suppose that X and Y are independent. Then Fy,y(x, y) = P(X <x and 
Y <y)=P(X <x)P(Y < y)= Fx(x) Fy(y), and we can write 


2 2 


7) te) d d 
fxy%, y= ax dy Pe y= tay x(x) Fy (y) = in ay te? = fx) fr) 


Next we need to show that Equation 3.7.2 implies that X and Y are independent. 
To begin, note that 


oe if fei / Peace erie / h(y) dy 


Set k= [2 ho)dy, so fx(x) = kg(x). Similarly, it can be shown that fy(y) = 
(1/k)h(y). Therefore, 


pxedandyes)= ff feres.ydxay= ff soonoraray 
AJB AJB 


=f [ keasinoyaray=f fecoar f foray 
AJB A B 


=P(X€A)P(Y EB) 


and the theorem is proved. 


Comment Theorem 3.7.4 can be adapted to the case that X and Y are discrete. 


Suppose that the probabilistic behavior of two random variables X and Y is 
described by the joint pdf fy. y(x, y)=12xy(1—y),0<x<1,0<y<1.Are X and Y 
independent? If they are, find fy(x) and fy(y). 

According to Theorem 3.7.4, the answer to the independence question is “yes” 
if fxy(x, y) can be factored into a function of x times a function of y. There are such 
functions. Let g(x) = 12x and h(y)=y(1— y). 

To find fx(x) and fy(y) requires that the “12” appearing in fx y(x, y) be 
factored in such a way that g(x) -h(y) = fx(x)- fy(y). Let 


love) 1 1 1 
k= | ny dy= | y(l—y) dy =[y?/2— y?/3] 3 
—0oo 0 0 


Therefore, fx (x) =kg(x) = ¢(12x) =2x,0<x < Land fy(y) = (1/Kh(y) =6y(1 — y), 
O<y<l. o 


Independence of n (>2) Random Variables 


In Chapter 2, extending the notion of independence from two events to n events 
proved to be something of a problem. The independence of each subset of the n 
events had to be checked separately (recall Definition 2.5.2). This is not necessary 
in the case of n random variables. We simply use the extension of Theorem 3.7.4 to 
n random variables as the definition of independence in the multidimensional case. 
The theorem that independence is equivalent to the factorization of the joint pdf 
holds in the multidimensional case. 


Example 
3.7.12 
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Definition 3.7.6. The n random variables X,, X2,..., X, are said to be inde- 
pendent if there are functions g1(x1), g2(x2),-.-.,8n(%n) such that for every 
X1,X2, vee Xn 


IK) Xap, Xn Ht X25 + +5 Xn) = 81 (11) B2(X2) +++ Bn Xn) 


A similar statement holds for discrete random variables, in which case f is 
replaced with p. 


Comment Analogous to the result for n =2 random variables, the expression on the 
right-hand side of the equation in Definition 3.7.6 can also be written as the product 
of the marginal pdfs of X1, X2,..., and Xp. 


Consider k urns, each holding n chips numbered 1 through n. A chip is to be drawn 
at random from each urn. What is the probability that all & chips will bear the same 
number? 

If X,, X2,..., X; denote the numbers on the 1st, 2nd, ..., and kth chips, respec- 
tively, we are looking for the probability that X; = X.=---= X,. In terms of the joint 
pdf, 


P(X, =X.=:::=X,) = > PX1,X0,....X_ (X15 XI, +++, XK) 


X1HXQ= XK 


Each of the selections here is obviously independent of all the others, so the joint 
pdf factors according to Definition 3.7.6, and we can write 


P(X, =X2=--- =X) => px, (1) pra (ai) +++ Px, (i) 
i=l 
(; 1 | 
=n- ee et gel = 
non n 
1 
~ kl = 


Random Samples 


Definition 3.7.6 addresses the question of independence as it applies to n random 
variables having marginal pdfs—say, fx,(«1), fx,(%2),..., fx,(%n)—that might be 
quite different. A special case of that definition occurs for virtually every set of 
data collected for statistical analysis. Suppose an experimenter takes a set of n mea- 
surements, x), x2,...,X,, under the same conditions. Those X;’s, then, qualify as a 
set of independent random variables—moreover, each represents the same pdf. The 
special —but familiar—notation for that scenario is given in Definition 3.7.7. We will 
encounter it often in the chapters ahead. 


Definition 3.7.7. Let X,, X2,...,X, be a set of n independent random vari- 
ables, all having the same pdf. Then X,, X2,..., X, are said to be a random 
sample of size n. 
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Questions 


3.7.38. Two fair dice are tossed. Let X denote the number 
appearing on the first die and Y the number on the second. 
Show that X and Y are independent. 


3.7.39. Let fxy(x, y) =A7e%"™, 0 < x, 0 < y. Show that 
X and Y are independent. What are the marginal pdfs in 
this case? 


3.7.40. Suppose that each of two urns has four chips, num- 
bered 1 through 4. A chip is drawn from the first urn and 
bears the number X. That chip is added to the second 
urn. A chip is then drawn from the second urn. Call its 
number Y. 


(a) Find py y(x, y). 
(b) Show that py(k) = py(k) = +,k=1, 2,3, 4. 
(c) Show that X and Y are not independent. 


3.7.41. Let X and Y be random variables with joint pdf 
fxy@, =k, 


Give a geometric argument to show that X and Y are not 
independent. 


O<x<1, O<y<l, O<x+y<1 


3.7.42. Are the random variables X and Y independent if 
fxv(, y) = F(x +2y),0<x<1,0<y<1? 


3.7.43. Suppose that random variables X and Y are inde- 
pendent with marginal pdfs fy(x) = 2x, O< x <1, and 
fy) =3y’, 0<y <1. Find P(Y <X). 

3.7.44. Find the joint cdf of the independent random vari- 
ables X and Y, where fx(x) = = O<x <2,and f(y) =2y, 
O<y<l. 

3.7.45. If two random variables X and Y are independent 
with marginal pdfs fy(x) = 2x, 0<x <1, and f(y) = 1, 
0<y <1, calculate P (4 > 2). 


3.7.46. Suppose fx.y(x, y)=xye"°™, x > 0, y > 0. Prove 
for any real numbers a, b, c, and d that 


P(a<X <b,c<Y <d)=P(a<X <b)-P(c<Y«<d) 
thereby establishing the independence of X and Y. 


3.7.47. Given the joint pdf fy y(x, y) = 2x +. y—2xy,0< 
x <1,0<y <1, find numbers a, b, c, and d such that 


Pia<X <b,c<Y¥ <d)#P(a<X <b)-P(c<Y<d) 


thus demonstrating that X and Y are not independent. 


3.7.48. Prove that if X and Y are two independent ran- 
dom variables, then U = g(X) and V =/A(Y) are also 
independent. 


3.7.49. If two random variables X and Y are defined over 
a region in the X Y-plane that is not a rectangle (possibly 
infinite) with sides parallel to the coordinate axes, can X 
and Y be independent? 


3.7.50. Write down the joint probability density function 
for arandom sample of size n drawn from the exponential 
pdf, f(x) = (1/Ae**, x > 0. 

3.7.51. Suppose that X,, X2, X3, and X, are independent 


random variables, each with pdf fy, (x;) = 4x3,0<x; <1. 
Find 


(a) P(X, <3). 

(b) P (exactly one X; < s). 
(c) Fry .X5,X3,X4(%1, X2, X3, X4). 
(d) Fy,.x; (2, %3). 


3.7.52. A random sample of size n = 2k is taken from 
a uniform pdf defined over the unit interval. Calculate 
P(X, < 5, X)> $,X3< +,X4> Sy eee, Xe > 5). 


3.8 Transforming and Combining Random 


Variables 


Transformations 


Transforming a variable from one scale to another is a problem that is comfortably 
familiar. If a thermometer says the temperature outside is 83°F, we know that the 
temperature in degrees Celsius is 28: 


c= (3) °F 32)=(3) 83 — 32) =28 
=(5) CF-32=(5) @3-32)= 


An analogous question arises in connection with random variables. Suppose that 
X is a discrete random variable with pdf px(k). If a second random variable, Y, is 
defined to be aX + b, where a and b are constants, what can be said about the pdf 


for Y? 


Theorem 
3.8.1 


Example 
3.8.1 


Theorem 
3.8.2 
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Suppose X is a discrete random variable. Let Y =aX +b, where a and b are constants. 
Then py(y) = px (2). 


—b —b 
Proof py(y) = PO’ =y)= Pax +b=»)=P (x= 2") = py (2 ) 


a 


Let X be a random variable for which px(k) = > for k=1,2,..., 10. What is the 
probability distribution associated with the random variable Y, where Y = 4X — 1? 
That is, find py(y). 

From Theorem 3.8.1, P(Y = y)= P(4X —l=y)=P[X =(y+1)/4]= px (+), 
which implies that py(y) = + for the ten values of (y + 1)/4 that equal 1, 2, ..., 10. 
But (y + 1)/4=1 when y = 3, (y+ 1)/4=2 when y=7,...,(y + 1)/4= 10 when 


y = 39. Therefore, py(y) = 7, for y=3,7,..., 39. = 


Next we give the analogous result for a linear transformation of a continuous 
random variable. 


Suppose X is a continuous random variable. Let Y =aX + b, where a #0 and bisa 


constant. Then 
1 —b 
fr) =—fe (=) 
a 


la| 


Proof We begin by writing an expression for the cdf of Y: 
Fy(y) =P <y)= P(@X+b<y)= PX <y—b) 
At this point we need to consider two cases, the distinction being the sign of a. 


Suppose, first, that a > 0. Then 


Fy(y)= Pax sy-b)=P (X= 2") 
a 


and differentiating Fy(y) yields fy(y): 


d d —b\ 1 —b 1 —b 
fro= = FrQ)= F.(2 )- fe(2 )- fe(? ) 
y dy a a a la| a 


Ifa <0, 


Fy(y) = Pax sy—b)=P (X> 2=*)=1-p(x<*) 
a 


Differentiation in this case gives 


d d y—b 1 y—b 1 y—b 
Oa str) 1 F.( )|- fe ( )- fx ( ) 
y dy a a a lal a 


and the theorem is proved. 


Now, armed with the multivariable concepts and techniques covered in 
Section 3.7, we can extend the investigation of transformations to functions defined 
on sets of random variables. In statistics, the most important combination of a set of 
random variables is often their sum, so we continue this section with the problem of 
finding the pdf of X + Y. 
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Finding the Pdf of a Sum 


Theorem Suppose that X and Y are independent random variables. Let W = X +. Y. Then 


3.8.3 
1. If X and Y are discrete random variables with pdfs px (x) and py (y), respectively, 


pw(w)= )) px(x)py(w—x) 


all x 
2. If X and Y are continuous random variables with pdfs fx(x) and fy(y), 
respectively, 
fiw) = f fx (x) fr (w — x) dx 
Proof 


1. pw(w)= P(W=w)=P(X+Y=w) 
=P(®=x,¥=w-x)=)) PX =x,¥=w-x) 


all x all x 
=) > P(X =x) PY =w—x) 
all x 
=> px(x)py(w —x) 
all x 


where the next-to-last equality derives from the independence of X and Y. 

2. Since X and Y are continuous random variables, we can find fw (w) by differ- 
entiating the corresponding cdf, Fy(w). Here, Fw(w) = P(X + Y < w) is found 
by integrating fy y(x, y) = fx(x)- fy(y) over the shaded region R, as pictured 
in Figure 3.8.1. 


Figure 3.8.1 


By inspection, 


F,(w) = i / fale) fr) dy dx = / reco] [ - fr) dy| dx 


oe) 


=] Sx (x) Fy (w — x) dx 


Assume that the integrand in the above equation is sufficiently smooth so that 
differentiation and integration can be interchanged. Then we can write 


Example 
3.8.2 
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d d [{* we d 
fiv(w) = Fw) = | feo) Frw =a) dx= | fe] £ Frew | dx 
w dw J_x 68 dw 


= Ix (x) fy (w — x) dx 


and the theorem is proved. 


Comment The integral in part (2) above is referred to as the convolution of 
the functions fy and fy. Besides their frequent appearances in random variable 
problems, convolutions turn up in many areas of mathematics and engineering. 


Suppose that X and Y are two independent binomial random variables, each with 
the same success probability but defined on m and n trials, respectively. Specifically, 


m\ & m—k 
pt=(1)p (l—p)"*, k=0,1,...,m 
and 
n k n—k 
pvit=(f)p (l—p)"“, k=0,1,...,n 


Find pw(w), where W=X+Y. 
By Theorem 3.8.3, pw(w) = >> px(x) py (w — x), but the summation over “all x” 


all x 
needs to be interpreted as the set of values for x and w —.x such that px(x) and 
Py(w — x), respectively, are both nonzero. But that will be true for all integers x 
from 0 to w. Therefore, 


Pww)=)- px(x)py(w—x)=)> (") p*(1— py" (, id ) cae 6 a 


x=0 x=0 


= >. (")( n Jona _ pyr 
Xx W-xX 


Now, consider an urn having m red chips and n white chips. If w chips are 
drawn out—without replacement—the probability that exactly x red chips are in 
the sample is given by the hypergeometric distribution, 


Summing Equation 3.8.1 from x =0 to x = w must equal 1 (why?), in which case 


Bt Neda) 


Xx 


P(x reds in sample) = (3.8.1) 


sO 
m+n 
pw(w)= ( a pra - pyrene w=0,1,...,.2+m 


Should we recognize pw(w)? Definitely. Compare the structure of pw(w) to the 
statement of Theorem 3.2.1: The random variable W has a binomial distribution 
where the probability of success at any given trial is p and the total number of trials 
isn+m. a 
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Example 


3.8.3 


Comment Example 3.8.2 shows that the binomial distribution “reproduces” 
itself—that is, if X and Y are independent binomial random variables with the same 
value for p, their sum is also a binomial random variable. Not all random variables 
share that property. The sum of two independent uniform random variables, for 
example, is not a uniform random variable (see Question 3.8.3). 


Suppose a radiation monitor relies on an electronic sensor, whose lifetime X is mod- 
eled by the exponential pdf, f(x) = Ae~**, x > 0. To improve the reliability of the 
monitor, the manufacturer has included an identical second sensor that is activated 
only in the event the first sensor malfunctions. (This is called cold redundancy.) 
Let the random variable Y denote the operating lifetime of the second sensor, in 
which case the lifetime of the monitor can be written as the sum W = X + Y. Find 
fw). 


Since X and Y are both continuous random variables, 
[o.e) 
fiw =f Sx) fr (w — x) dx (3.8.2) 
—o0o 


Notice that fx(x) > 0 only if x > 0 and that fy(w — x) > 0 only if x < w. Therefore, 
the integral in Equation 3.8.2 that goes from —oo to oo reduces to an integral from 
0 to w, and we can write 


fw(w) = / fx(x) fy(w — x) dx = / Neg Pe deai- i ee a 
0 0 0 


w 
= are hw / dx= Vue”, w>0 
0 


Comment By integrating fy(x) and fw(w), we can assess the improvement in the 
monitor’s reliability afforded by the cold redundancy. Since X is an exponential ran- 
dom variable, E(X) = 1/A (recall Question 3.5.11). How different, for example, are 
P(X > 1/4) and P(W > 1/A)? A simple calculation shows that the latter is actually 
twice the magnitude of the former: 


oe) 
P(X >1/a) = ie dg | Se = 0.37 
1/a 


oe) 
P(W>1/Ad) =i we” dw =e (—u — 1)|°=2e7' = 0.74 
1/a 


Finding the Pdfs of Quotients and Products 


We conclude this section by considering the pdfs for the quotient and product of two 
independent random variables. That is, given X and Y, we are looking for fw(w), 
where (1) W=Y/X and (2) W= XY. Neither of the resulting formulas is as impor- 
tant as the pdf for the sum of two random variables, but both formulas will play key 
roles in several derivations in Chapter 7. 
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Theorem Let X and Y be independent continuous random variables, with pdfs fx(x) and fy(y), 
3.8.4 respectively. Assume that X is zero for at most a set of isolated points. Let W=Y/X. 
Then 


fiv(w) = fx fc fry a 
Proof 
Fyw(w) = P(Y/X <w) 
=P(Y/X<w and X>0)4+P(Y/X>w and X<0O) 
=P(¥<wX and X>0)+P(Y>wxX and X <0) 
=P(¥<wX and X>0)+1—P(¥<wX and xX <0O) 


oo wx 0 wx 
=| [ feof dydx+1— f / Sx (x) fr (y) dy dx 


Then we differentiate Fy(w) to obtain 


d d CO pwx d 0 wx 
fiv w= Fw) = | i fx fry) dydx—— | / Sx (x) fy (y) dy dx 
w dw Jo Joo dW Joo I-00 


lee) d wx 0 d wx 
= if feo (= / fr) dy) dx - / fal) (= i ful) dy) ds 
0 W J_o —oo dw —~oo 


(3.8.3) 


(Note that we are assuming sufficient regularity of the functions to permit inter- 
change of integration and differentiation.) 

To proceed, we need to differentiate the function G(w) = Wes fv(y) dy with 
respect to w. By the Fundamental Theorem of Calculus and the chain rule, we find 


d d we d 
—Gw)= ai. f(y) dy = fr (wx) wx = xfo(we) 


Putting this result into Equation 3.8.3 gives 
0 


fiw) =f f(a) folws) dx — f Xfx (x) fy (wx) dx 


ee) 0 
= / nC xydet / EHO Cry 


0 


=} |x| fx (x) fy (wx) dx +f |x| fx (x) fy (wx) dx 


=| Ll fac) fr (wx) dx 


which completes the proof. 


—Ax 


Example Let X and Y be independent random variables with pdfs fy(x) =Ae~"*, x > 0, and 


3.8.4 fr(y) =Ae”, y > 0, respectively. Define W = Y/X. Find fw(w). 
Substituting into the formula given in Theorem 3.8.4, we can write 


CO CO 
fw(w) = i x(Ae*) (Ae) dx =A? i Ke HOSMER de 
0 0 


2 lore) 
= mo! xA(L + we dx 
Ad +w) Jo 
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Notice that the integral is the expected value of an exponential random variable 
with parameter 4(1 + w), so it equals 1/A(1 + w) (recall Example 3.5.6). Therefore, 


i >0 
WE ay tie Caw = = 
Theorem Let X and Y be independent continuous random variables with pdfs fx(x) and fy(y), 
3.8.5 respectively. Let W = XY. Then 
° | oe i 
fviw)= [fed frw/xydx= [ fetw/s) fds 
—oo |x| —0o |x| 
Proof A line-by-line, straightforward modification of the proof of Theorem 3.8.4 
will provide a proof of Theorem 3.8.5. The details are left to the reader. 
Example Suppose that X and Y are independent random variables with pdfs f(x) =1,0< 
3.8.5 x <1l,and fy(y) =2y,0<y<1, respectively. Find fy(w), where W= XY. 


According to Theorem 3.8.5, 


o.e) 

fv(w)= [fea frw/x)as 
-0o || 

The region of integration, though, needs to be restricted to values of x for which the 

integrand is positive. But fy(w/x) is positive only if 0 < w/x < 1, which implies that 

x > w. Moreover, for fx (x) to be positive requires that 0 < x < 1. Any x, then, from 

w to 1 will yield a positive integrand. Therefore, 


My ry 
fv(w) =f ~()Qw/x)dx=2w f = dx =2—2u, O0<w<l 
w xX wx 


Comment Theorems 3.8.3, 3.8.4, and 3.8.5 can be adapted to situations where X 
and Y are not independent by replacing the product of the marginal pdfs with the 


joint pdf. 


Questions 


3.8.1. Let X and Y be two independent random vari- 
ables. Given the marginal pdfs shown below, find the pdf 
of X + Y. In each case, check to see if X + Y belongs to the 
same family of pdfs as do X and Y. 


wus k 
(a) px(k)=e*— and py(k) =e", k =0, 1,2,... 
k! k! 


(b) py (kK) = py(k) =(1— p)'p, k= 1,2... 


3.8.2. Suppose fy (x) =xe™*, x >0, and fy(y) =e”, y=0, 
where X and Y are independent. Find the pdf of X+ Y. 


3.8.3. Let X and Y be two independent random vari- 
ables, whose marginal pdfs are given below. Find the pdf of 
X + Y. (Hint: Consider two cases, 0< w <1 and1<w <2.) 


fx(x)=1, O<x <1, and fy(y)=1, O< y<1 


3.8.4. If a random variable V is independent of two 
independent random variables X and Y, prove that V is 
independent of X+ Y. 


3.8.5. Let Y be a continuous nonnegative random vari- 
able. Show that W = Y? has pdf fy(w) = sya tr(/w). 

[ Hint: First find Fy (w).] 

3.8.6. Let Y be a uniform random variable over the 
interval [0, 1]. Find the pdf of W=Y?. 

3.8.7. Let Y be a random variable with fy(y) =6y(1 — y), 
0<y<1. Find the pdf of W=Y?. 


3.8.8. Suppose the velocity of a gas molecule of mass m is 
a random variable with pdf fy(y) = ay?e~’”, y > 0, where 
a and b are positive constants depending on the gas. Find 
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the pdf of the kinetic energy, W = (m/2)Y*, of such a the cdf of Y/X. (Hint: Consider two cases, 0 < w < 1 and 


molecule. 


1<w.) 


3.8.9. Given that X and Y are independent random vari- 


ables, find the pdf of XY for the following two sets of 


marginal pdfs: 


(a) fx(x)=1,0<x <1, and fpQ)=1,0<y<1 
(b) fx(x)=2x,0<x <1, and fy(y)=2y,0<y<1 


(a) fy(x)=1,0<x<Il,and fp) =1,0<y<1 
(b) fx(x)=2x,0<x<1,and fy(y) =2y,0<y<l 


3.8.11. Suppose that X and Y are two independent ran- 


3.8.10. Let X and Y be two independent random dom variables, where f(x) =xe™“, x >0, and fy(y) =e, 
variables. Given the marginal pdfs indicated below, find y>0. Find the pdf of Y/X. 


Theorem 
3.9.1 


Example 
3.9.1 
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Sections 3.5 and 3.6 introduced the basic definitions related to the expected value 
and variance of single random variables. We learned how to calculate E(W), 
E[g(W)], E(aW +b), Var(W), and Var(aW +b), where a and b are any constants 
and W could be either a discrete or a continuous random variable. The purpose of 
this section is to examine certain multivariable extensions of those results, based on 
the joint pdf material covered in Section 3.7. 

We begin with a theorem that generalizes E[g(W)]. While it is stated here for 
the case of two random variables, it extends in a very straightforward way to include 
functions of n random variables. 


1. Suppose X and Y are discrete random variables with joint pdf px.y(x, y), and 
let g(X, Y) be a function of X and Y. Then the expected value of the random 
variable g(X, Y) is given by 


E[g(X, Y=) >) 80, y): pxy@, y) 


all x all y 


provided ¥* >~ |g(x, y)|- px.y(x, y) <@. 
all x all y 
2. Suppose X and Y are continuous random variables with joint pdf fx.y(x, y), 
and let g(X, Y) be a continuous function. Then the expected value of the random 


variable g(X, Y) is given by 
E[g(X, Y)] =f / 8(x, y)- fx,y@, y) dx dy 


provided f°, [° \g(x, y)|- fx,y(x, y) dx dy < oo. 


Proof The basic approach taken in deriving this result is similar to the method 
followed in the proof of Theorem 3.5.3. See (128) for details. 


Consider the two random variables X and Y whose joint pdf is detailed in the 2 x 4 
matrix shown in Table 3.9.1. Let 


g(X, Y)=3X -2XY+Y 


Find E[g(X, Y)] two ways—first, by using the basic definition of an expected value, 
and second, by using Theorem 3.9.1. 

Let Z=3X —2XY+/Y. By inspection, Z takes on the values 0, 1, 2, and 3 
according to the pdf fz(z) shown in Table 3.9.2. Then from the basic definition 
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Example 


3.9.2 


Table 3.9.1 


cole 
A 
oo 


cole 
Ale 
col 


Table 3.9.2 


Ke 0 1 
fz) t 


Bie | 
ee) 


! 
2 


that an expected value is a weighted average, we see that E[g(X,Y)] is equal 
to 1: 


E[g(X, Y)]=E(Z)=)_z- fz(2) 
allz 
=0 BAG nae 143 0 
"4 2 4 
=1 
The same answer is obtained by applying Theorem 3.9.1 to the joint pdf given in 
Figure 3.9.1: 


1 1 1 1 1 1 
E[g(X, Y)]=0--+1-—+2--+3-04+3-0+2--+1--+0--= 
[e( )] at rts a + + - rial 5 


=1 


The advantage, of course, enjoyed by the latter solution is that we avoid the 
intermediate step of having to determine fz(z). = 


An electrical circuit has three resistors, Ry, Ry, and Rz, wired in parallel (see 
Figure 3.9.1). The nominal resistance of each is fifteen ohms, but their actual 
resistances, X, Y, and Z, vary between ten and twenty according to the joint pdf 


1 10<x <20 
Fxy,z0¥,2=aen oy txzt+yz), 10<y<20 
675,000 10<2<20 
What is the expected resistance for the circuit? 
Ry 
= om 


Figure 3.9.1 
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Let R denote the circuit’s resistance. A well-known result in physics holds that 


Lo a 1 n 1 
a ae ae? 
or, equivalently, 
XYZ 
ie 2) 
XY+XZ+YZ 


Integrating R(x, y, z)- fx.y.z(, y, z) shows that the expected resistance is five: 
1 
E(R)= xy+xz+yz)dx dy dz 
a I, [ [ ee reey "675,000" pean 


1 20 20 p20 
= —___ dxdyd 
675,000 A I, I, a ai 


=5.0 = 
Theorem Let X and Y be any two random variables (discrete or continuous, dependent or 
3.9.2 independent), and let a and b be any two constants. Then 


E(aX + bY)=aE(X)+bE(Y) 
provided E(X) and E(Y) are both finite. 


Proof Consider the continuous case (the discrete case is proved much the same 
way). Let fx,y(x, y) be the joint pdf of X and Y, and define g(X, Y) =aX + bY. 
By Theorem 3.9.1, 


E(ax +bY)= [ / (ax + by) fx,y (x, y) dx dy 
| / (ax) fa.r(e dx dy + | / Mien 


=a || fxv(e,y)dy| ax | off fv») x] dy 


=a false +b f yfy(y) dy 


=aE(X)+bE(Y) 


Corollary Let W,, W2,..., W, be any random variables for which E(W;) < co, i=1,2,...,n, 


3.9.1 and let ay, dz, ..., An be any set of constants. Then 
E (a, W, + a2.W2 +--+ + anWn) =a, E(W,) + a2 E(W2) +--+ Gn E(Wn) < 
Example Let X be a binomial random variable defined on n independent trials, each trial 
3.9.3 resulting in success with probability p. Find E(x). 


Note, first, that X can be thought of as asum, X = X; + X.+---+ X,, where X; 
represents the number of successes occurring at the ith trial: 


__f 1 if the/th trial produces a success 
; 0 if the7th trial produces a failure 
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Example 


3.9.4 


(Any X; defined in this way on an individual trial is called a Bernoulli random 
variable. Every binomial random variable, then, can be thought of as the sum of 
n independent Bernoullis.) By assumption, px,(1) = p and px,(0)=1-p,i=1, 
2,...,n. Using the corollary, 


E(X) = E(X)) + E(X2) +++ + E(Xn) 
=n- E(X,) 
the last step being a consequence of the X;’s having identical distributions. But 
E(X;)=1-p+0-U-—p)=p 
so E(X)=np, which is what we found before (recall Theorem 3.5.1). = 


Comment The problem-solving implications of Theorem 3.9.2 and its corollary 
should not be underestimated. There are many real-world events that can be mod- 
eled as a linear combination a, W, + a2W2+----+-a,W,, where the W;,’s are relatively 
simple random variables. Finding E (a; W; + a2W2+----+a,W,) directly may be pro- 
hibitively difficult because of the inherent complexity of the linear combination. It 
may very well be the case, though, that calculating the individual E(W;)’s is easy. 
Compare, for instance, Example 3.9.3 with Theorem 3.5.1. Both derive the formula 
that E(X) =np when X is a binomial random variable. However, the approach 
taken in Example 3.9.3 (i.e., using Theorem 3.9.2) is much easier. The next several 
examples further explore the technique of using linear combinations to facilitate the 
calculation of expected values. 


A disgruntled secretary is upset about having to stuff envelopes. Handed a box of 
n letters and n envelopes, she vents her frustration by putting the letters into the 
envelopes at random. How many people, on the average, will receive their correct 
mail? 

If X denotes the number of envelopes properly stuffed, what we want is E(X). 
However, applying Definition 3.5.1 here would prove formidable because of the 
difficulty in getting a workable expression for px (k) [see (95)]. By using the corollary 
to Theorem 3.9.2, though, we can solve the problem quite easily. 

Let X; denote a random variable equal to the number of correct letters put into 
the ith envelope, i=1,2,...,n. Then X; equals 0 or 1, and 


1 
_ fork=1 
px, (k) = P(X == 42 
fork =0 


But X =X,4+ X.+---+X, and E(X)= E(X,)+ E(X2)+---+ E(X,). Furthermore, 
each of the X;’s has the same expected value, 1 /n: 
1 


n—-1 1 1 
E(X;)= ‘ .—k)—(). ——_ pete 
(X) => ok P(X =k) =0 —t1~=- 
k=0 
It follows that 

= 1 

E(X)=)  E(X)=n-(- 

(X=) E(X)=an (;) 


i=l 
=1 


showing that, regardless of n, the expected number of properly stuffed envelopes is 
one. (Are the X;’s independent? Does it matter?) a 


Example 
3.9.5 


Example 
3.9.6 
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Ten fair dice are rolled. Calculate the expected value of the sum of the faces showing. 
If the random variable X denotes the sum of the faces showing on the ten dice, 
then 


X=X,+X2+---+X10 


where X; is the number showing on the ith die, i= 1,2,...,10. By assumption, 
6 6 
px,(k) = } for k= 1,2,3,4,5, 6, so Oe al oe i; - © = 3.5. By the 


corollary to Theorem 3.9.2, 
E(X) = E(X1) + E(X2) +--+ + E(X10) 
= 10(3.5) 
= 35 
Notice that E(X) can also be deduced here by appealing to the notion that 

expected values are centers of gravity. It should be clear from our work with combi- 
natorics that P(X = 10) = P(X =60), P(X =11)= P(X =59), P(X =12)= P(X =58), 
and so on. In other words, the probability function px (k) is symmetric, which implies 


that its center of gravity is the midpoint of the range of its X-values. It must be the 
case, then, that E(X) equals oe or 35. a 


The honor count in a (thirteen-card) bridge hand can vary from zero to thirty-seven 
according to the formula: 


honor count = 4. (number of aces)+3-(number of kings) +2- (number of queens) 
+ 1- (number of jacks) 


What is the expected honor count of North’s hand? 

The solution here is a bit unusual in that we use the corollary to Theorem 3.9.2 
backward. If X;,i=1, 2, 3, 4, denotes the honor count for players North, South, East, 
and West, respectively, and if X denotes the analogous sum for the entire deck, we 
can write 


X=X,+X%2+X34+X4 
But 
X=E(X)=4-44+3-44+2-441-4=40 


By symmetry, E(X;) = E(X,;), i € j, so it follows that 40 = 4- E(X,), which implies 
that ten is the expected honor count of North’s hand. (Try doing this problem 
directly, without making use of the fact that the deck’s honor count is forty.) = 


Expected Values of Products: A Special Case 
We know from Theorem 3.9.1 that for any two random variables X and Y, 


2 So xypxy(, y) if X and Y are discrete 
E(XY) = all at Yo 
/ xyfx.y(x,y) dx dy if X and Y are continuous 
—oo J—o0o 


If, however, X and Y are independent, there is an easier way to calculate E(XY). 
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Theorem If X and Y are independent random variables, 
3.9.3 


E(XY) = E(X)- E(Y) 
provided E(X) and E(Y) both exist. 


Proof Suppose X and Y are both discrete random variables. Then their joint pdf, 
Px.y(x, y), can be replaced by the product of their marginal pdfs, px (x) - py(y), and 
the double summation required by Theorem 3.9.1 can be written as the product of 


two single summations: 


E(XY)=)0 > oxy: pxy(x,y) 


all x all y 

=) oxy: px): pry) 
all x all y 

=) ox: px(x):| Dy: pr) 
all x all y 

= E(X)-E(Y) 


The proof when X and Y are both continuous random variables is left as an 


exercise. 


Questions 


3.9.1. Suppose that r chips are drawn with replace- 
ment from an urn containing n chips, numbered 1 
through n. Let V denote the sum of the numbers drawn. 
Find E(V). 


3.9.2. Suppose that fxy(x, y) = Aero, O< x, OK<y. 
Find E(X +Y). 


3.9.3. Suppose that fy y(x, y) = (x + 2y),0<x<1,0< 
y <1 [recall Question 3.7.19(c)]. Find E(X + Y). 


3.9.4. Marksmanship competition at a certain level 
requires each contestant to take ten shots with each of two 
different handguns. Final scores are computed by taking 
a weighted average of 4 times the number of bull’s-eyes 
made with the first gun plus 6 times the number gotten 
with the second. If Cathie has a 30% chance of hitting 
the bull’s-eye with each shot from the first gun and a 40% 
chance with each shot from the second gun, what is her 
expected score? 


3.9.5. Suppose that X; is a random variable for which 
E(X;)=yu,i=1,2,...,n. Under what conditions will the 
following be true? 


E (ax) = 
i=l 


3.9.6. Suppose that the daily closing price of stock goes up 
an eighth of a point with probability p and down an eighth 
of a point with probability g, where p > q. After n days 
how much gain can we expect the stock to have achieved? 
Assume that the daily price fluctuations are independent 
events. 


3.9.7. An urn contains r red balls and w white balls. A 
sample of n balls is drawn in order and without replace- 
ment. Let X; be 1 if the ith draw is red and 0 otherwise, 
i=1,2,...,n. 


(a) Show that E(X;) = E(X,),i=2,3,...,n. 

(b) Use the corollary to Theorem 3.9.2 to show 
that the expected number of red balls is 
nr/(r+w). 


3.9.8. Suppose two fair dice are tossed. Find the expected 
value of the product of the faces showing. 


3.9.9. Find E(R) for a two-resistor circuit similar to the 
one described in Example 3.9.2, where fx y(x, y) =k(x+ 
y), 10<x <20,10<y<20. 


3.9.10. Suppose that X and Y are both uniformly dis- 
tributed over the interval [0, 1]. Calculate the expected 
value of the square of the distance of the random point 
(X, Y) from the origin; that is, find E(X? + Y’). (Hint: See 
Question 3.8.6.) 


3.9.11. Suppose X represents a point picked at random 
from the interval [0,1] on the x-axis, and Y is a point 
picked at random from the interval [0,1] on the y-axis. 
Assume that X and Y are independent. What is the 
expected value of the area of the triangle formed by the 
points (X, 0), (0, Y), and (0, 0)? 
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3.9.12. Suppose Y;, ¥5,..., Y, is a random sample from 
the uniform pdf over [0, 1]. The geometric mean of the 
numbers is the random variable «/Y; Y2----- Y,,. Compare 
the expected value of the geometric mean to that of the 
arithmetic mean Y. 


Calculating the Variance of a Sum of Random Variables 


When random variables are not independent, a measure of the relationship between 
them, their covariance, enters into the picture. 


Definition 3.9.1. Given random variables X and Y with finite variances, define 
the covariance of X and Y, written Cov(X, Y), as 


Cov(X, ¥Y) = E(XY) — E(X)E(Y) 


Theorem 
3.9.4 


If X and Y are independent, then Cov(X, Y) =0. 
Proof If X and Y are independent, by Theorem 3.9.3, E(XY) = E(X)E(Y). Then 


Cov(X, Y) = E(XY) — E(X)E(Y) = E(X)E(Y) — E(X)E(Y) =0 


The converse of Theorem 3.9.4 is not true. Just because Cov(X, Y) =0, we cannot 


conclude that X and Y are independent. Example 3.9.7 is a case in point. 


Example 
3.9.7 


Consider the sample space S = {(—2,4), (-1, 1), 0,0), (1, 1), (2,4)}, where each 
point is assumed to be equally likely. Define the random variable X to be the first 


component of a sample point and Y, the second. Then X (—2, 4) = —2, Y(—2, 4) =4, 


and so on. 


Notice that X and Y are dependent: 


sa P(X=1Y=)¢ PK =1)- PW =1)= 


2 
B) 


However, the convariance of X and Y is zero: 


E(XY)=[(—8)+(-1)+0+1+8]- 


=0 


E(X)=[(—2)+ (-1l) +0+1+42]--=0 


and 


al en| eS 


1 
EY)=@+14+0+144)-<=2 


SO 


Cov(X, Y) = E(XY) — E(X)- E(Y) =0—-0-2=0 = 


Theorem 3.9.5 demonstrates the role of the covariance in finding the variance 
of a sum of random variables that are not necessarily independent. 


Theorem 


3.9.5 constants. Then 


Suppose X and Y are random variables with finite variances, and a and b are 


Var(aX + bY) =a7Var(X) + b*Var(Y) + 2ab Cov(X, Y) 
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Proof For convenience, denote E(X) by wx and E(Y) by py. Then E(aX + bY) = 
apy + buy and 


Var(aX + bY) = E[(aX + bY)?] — (aux + buy)? 
SB @ Xk? 47? + 2abxV)= (a? wy + b? wy, + 2abux py) 
=[E(a’X’) — a? py) + [EW Y*) — b* yy] + [2abE(XY) — 2abux py] 
=a°[E(X) — wy) + BEY”) — wy] + 2ablE(XY) — wxiy] 
=a’ Var(X) + b* Var(Y) + 2abCov(X, Y) 


Example For the joint pdf fy y(x, y)=x+y,0<x<1,0<y<1, find the variance of X+Y. 
3.9.8 Since X and Y are not independent, 
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) 
The pdf is symmetric in X and Y, so Var(X) = Var(Y), and we can write Var(X + Y) = 
2[Var(X) + Cov(X, Y)]. 
To calculate Var(X), the marginal pdf of X is needed. But 


1 1 
fex)= | (r+ y)dyax+5 
0 


- w+) festa : 
=> XX = x= Xx _ x=-— 
oe |. 2 A 2 12 


l 1 : x? 5 
Boe) = [ Pot pars [ G+ —)\dea 
5 2 : 2 12 


a TAP 1 
Var(X) = E(X°) Ux=7 io) aaa 


pee ia r [ (5 a x3 x? 
( ee) ee aaa | 7 +3 t= 1G 


so, putting all of the pieces together, 
Cov(X, Y) = 1/3 — (7/12)(7/12) = —1/144 
and, finally, Var(X + Y) = 2[11/144 + (—1/144)]=5/36 = 


Then 


The two corollaries that follow are straightforward extensions of Theorem 3.9.5 
to n variables. The details of the proof will be left as an exercise. 


Corollary Suppose that W,, W2,..., W, are random variables with finite variances. Then 
Var (>: ij 7 = a a; Var(W;) +2 Y > aiajCov(Wi, Wj) 
i=l i=l i<j 4 
Corollary Suppose that W,, W2,..., W, are independent random variables with finite variances. 
Then 


Var(W, + Wo +---+ W,,) = Var(W,) + Var(W2) +----+ Var(W,,) 


More discussion of the covariance and its role in measuring the relationship between 
random variables occurs in Section 11.4. < 


Example 
3.9.9 


Example 
3.9.10 
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The binomial random variable, being a sum of n independent Bernoullis, is an obvi- 
ous candidate for the corollary to Theorem 3.9.5 on the sum of independent random 
variables. Let X; denote the number of successes occurring on the ith trial. Then 


a 1 with probability p 
‘~~ |0. with probability 1 — p 


and 
X=X,+X.+---+X, = total number of successes in n trials 
Find Var(X). 
Note that 
E(X))=1-p+0-—p)=p 
and 
E(X})=(1)?-p+(0)-d-p)=p 
so 


Var(X;) = E(X?) —[E(X)? = p— p? 
=p(l—p) 


It follows, then, that the variance of a binomial random variable is np(\ — p): 


Var(X) = )° Var(X;) =np(1 — p) 
i=l = 


Recall the hypergeometric model—an urn contains N chips, r red and w white (r + 
w= WN); arandom sample of size n is selected without replacement and the random 
variable X is defined to be the number of red chips in the sample. As in the previous 
example, write X as a sum of simple random variables. 


1 if the ith chip drawn is red 
ee 
0 otherwise 


Then X = X,+ X2+---+X,. Clearly, 
# 


fo 6) ey eee 
i ae NN 


and E(X)=n(+)=np, where p= ‘+. 
Since X? = X;, E(X?) = E(X;) = £ and 
) = E(X2) — aes 2) a 
Var(Xi) = B(X?) — EXP == — (5) =pd-p) 
Also, for any j #k, 


Cov(X;, Xz) = E(X; Xt) — E(XE(X) 
r\2 

=1-P(X)X=)-(>) 

r r-il a r N-r 1 

“N N-1 N2. NN. N-I 
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From the first corollary to Theorem 3.9.5, then, 


Var(X) = > Var(X;) +2 > Cov(X;, Xx) 


i=l j<k 


n 1 
=np(t — p)—2/ > ) a= 


N-1 
_ n(n — 1) 
=p(l pn =| 
N-n 
a Ae 2 err a = 


Example 
3.9.11 


In statistics, it is often necessary to draw inferences based on W, the average com- 
puted from a random sample of n observations. Two properties of W are especially 


important. First, if the W;’s come from a population where the mean is jz, the corol- 
lary to Theorem 3.9.2 implies that E(W) = jz. Second, if the W;’s come from a 
population whose variance is o”, then Var(W) =o7/n. To verify the latter, we can 
appeal again to Theorem 3.9.5. Write 


a 1 1 1 
W=-) Wi=-- Wit —-Wot---+--W 
n n n n 


Then 


i=1 


{\* Ly ih? 
~) -varew)+ (=) Var Wa) +--+ (2) - Var(W,,) 
n n n 


iy" 17 iy 
n n n 
o2 


Questions 


3.9.13. Suppose that two dice are thrown. Let X be the 
number showing on the first die and let Y be the larger of 
the two numbers showing. Find Cov(X, Y). 


3.9.14. Show that 
Cov(aX +b, cY +d) =acCov(X, Y) 
for any constants a, b,c, and d. 


3.9.15. Let U be a random variable uniformly distributed 
over [0, 277]. Define X =cosU and Y =sinU. Show that X 
and Y are dependent but that Cov(X, Y) =0. 


3.9.16. Let X and Y be random variables with 
1, 
Ixy, y= | 0, 


Show that Cov(X, Y) =0 but that X and Y are dependent. 


-y<x<y, O<y<l 


elsewhere 


3.9.17. Suppose that fy y(x, y) =A2eRO*”, O< x, 0K<y. 
Find Var(X + Y). (Hint: See Questions 3.6.11 and 3.9.2.) 


3.9.18. Suppose that fxy(x, y) = = (x +2y),0<x <1, 
O<y <1. Find Var(X + Y). (Hint: See Question 3.9.3.) 


3.9.19. For the uniform pdf defined over [0, 1], find the 
variance of the geometric mean when n = 2 (see Ques- 
tion 3.9.12). 


3.9.20. Let X be a binomial random variable based on 
n trials and a success probability of p,; let Y be an inde- 
pendent binomial random variable based on m trials and 
a success probability of py. Find E(W) and Var(W), where 
W=4X +6Y. 


3.9.21. Let the Poisson random variable U (see p. 227) be 
the number of calls for technical assistance received by a 
computer company during the firm’s nine normal work- 
day hours. Suppose the average number of calls per hour 
is 7.0 and that each call costs the company $50. Let V be a 
Poisson random variable representing the number of calls 
for technical assistance received during a day’s remaining 


fifteen hours. Suppose the average number of calls per 
hour is 4.0 for that time period and that each such call costs 
the company $60. Find the expected cost and the vari- 
ance of the cost associated with the calls received during a 
twenty-four-hour day. 


3.9.22. A mason is contracted to build a patio retaining 
wall. Plans call for the base of the wall to be a row of 
fifty 10-inch bricks, each separated by }-inch-thick mortar. 
Suppose that the bricks used are randomly chosen from a 
population of bricks whose mean length is 10 inches and 
whose standard deviation is = inch. Also, suppose that the 
mason, on the average, will make the mortar } inch thick, 
but that the actual dimension will vary from brick to brick, 
the standard deviation of the thicknesses being + inch. 
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What is the standard deviation of L, the length of the first 
row of the wall? What assumption are you making? 


3.9.23. An electric circuit has six resistors wired in series, 
each nominally being five ohms. What is the maximum 
standard deviation that can be allowed in the manufac- 
ture of these resistors if the combined circuit resistance is 
to have a standard deviation no greater than 0.4 ohm? 


3.9.24. A gambler plays n hands of poker. If he wins the 
kth hand, he collects k dollars; if he loses the kth hand, 
he collects nothing. Let T denote his total winnings in n 
hands. Assuming that his chances of winning each hand 
are constant and independent of his success or failure at 
any other hand, find E(7T) and Var(T). 


3.10 Order Statistics 


Example 
3.10.1 


The single-variable transformation taken up in Section 3.4 involved a standard linear 
operation, Y =aX + b. The bivariate transformations in Section 3.8 were similarly 
arithmetic, typically being concerned with either sums or products. In this section 
we will consider a different sort of transformation, one involving the ordering of 
an entire set of random variables. This particular transformation has wide applica- 
bility in many areas of statistics, and we will see some of its consequences in later 
chapters. 


Definition 3.10.1. Let Y be a continuous random variable for which 
Y1, Y2,---, Yn are the values of a random sample of size n. Reorder the y;’s from 
smallest to largest: 


/ / / 
Y< Wy <0 <<); 


(No two of the y;’s are equal, except with probability zero, since Y is contin- 
uous.) Define the random variable Y/ to have the value y;, 1 <i <n. Then Y/ 
is called the ith order statistic. Sometimes Y/ and Y; are denoted Ymax and Ymin, 
respectively. 


Suppose that four measurements are made on the random variable Y: y; = 3.4, y2= 
4.6, y3 = 2.6, and y4 = 3.2. The corresponding ordered sample would be 


2.6<3.2<3.4<4.6 


The random variable representing the smallest observation would be denoted Y,, 
with its value for this particular sample being 2.6. Similarly, the value for the second 
order statistic, Y;, is 3.2, and so on. = 


The Distribution of Extreme Order Statistics 


By definition, every observation in a random sample has the same pdf. For example, 
if a set of four measurements is taken from a normal distribution with 4 = 80 and 
o = 15, then fy,(y), fr. (v), f(y), and fy,() are all the same—each is a normal 
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Theorem 
3.10.1 


Example 
3.10.2 


pdf with ~ = 80 and o = 15. The pdf describing an ordered observation, though, is 
not the same as the pdf describing a random observation. Intuitively, that makes 
sense. If a single observation is drawn from a normal distribution with 4. = 80 and 
o = 15, it would not be surprising if that observation were to take on a value near 80. 
On the other hand, if a random sample of n = 100 observations is drawn from that 
same distribution, we would not expect the smallest observation—that is, Ymin —to 
be anywhere near 80. Common sense tells us that that smallest observation is likely 
to be much smaller than 80, just as the largest observation, Ymax, is likely to be much 
larger than 80. 

It follows, then, that before we can do any probability calculations—or any 
applications whatsoever—involving order statistics, we need to know the pdf of Y; 
for i =1,2,...,n. We begin by investigating the pdfs of the “extreme” order statis- 
tics, fy,,,.(y) and fy,,,(y). These are the simplest to work with. At the end of the 
section we return to the more general problems of finding (1) the pdf of Y/ for any i 
and (2) the joint pdf of Y/ and Y;, where i < j. 


Suppose that Y,, Yx, ..., Y, is arandom sample of continuous random variables, each 
having pdf fy(y) and cdf Fy(y). Then 


a. The pdf of the largest order statistic is 


Fina) = fy 9) = nly (YI FO) 
b. The pdf of the smallest order statistic is 


F¥nin(Y) = fri (y) = nll = Fr)" fr) 


Proof Finding the pdfs of Ynax and Ymin is accomplished by using the now-familiar 
technique of differentiating a random variable’s cdf. Consider, for example, the case 
of the largest order statistic, Y’: 


Fy (¥) = FY¥max (¥) = P Ymax Sy) 
=P(Y<y,YoSy,---, Yn Sy) 
= P(Yi <y)-P(Y%2<y)--- Pn Sy) (why?) 
=[Fy(y)]" 
Therefore, 
fy”) = d/dylL Fy (y)"]=nlFy yl" fr) 
Similarly, for the smallest order statistic (i = 1), 
Fy (y) = Fyn (Y) = P Yimin S Y) 
=1-—P(Ymin > y)=1— P(M1 > y)- P(X > y)-++ Pn > y) 
=1-[l-FyQ)]" 


Therefore, 


fry) =d/dy[1- [1 — Fy) I" =n — FO" fr) 


Suppose a random sample of n = 3 observations— Yj, Y2, and Y3;—is taken from the 
exponential pdf, fy(y) =e”, y >0. Compare fy,(y) with fri(y). Intuitively, which 
will be larger, P(Y; <1) or P(Y; <1)? 
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The pdf for Y,, of course, is just the pdf of the distribution being 
sampled —that is, 


fy o=froy=e", y=0 


To find the pdf for Yj requires that we apply the formula given in the proof of 
Theorem 3.10.1 for fy, (y). Note, first of all, that 


y 
Fy(y) = e'dt=—e" lo =l-e” 
0 
Then, since n = 3 (and i = 1), we can write 


fry) =301 -d-e™) Pe” 
=3e”, y>0 


fy,0) =3e 
1 


Probability 
density 


Figure 3.10.1 


Figure 3.10.1 shows the two pdfs plotted on the same set of axes. Compared 
to fy,(y), the pdf for Y; has more of its area located above the smaller values of 
y (where Y; is more likely to lie). For example, the probability that the smallest 
observation (out of three) is less than 1 is 95%, while the probability that a random 
observation is less than 1 is only 63%: 


3 


1 3 
P(Y; <1) =) 3e-*” dy = / e“ du=—e"| =1-e? 
0 0 0 
= 0.95 
1 1 
P(Y, <)= / e’ dy=—e| =1-e! 
0 0 
= 0.63 = 
Example Suppose a random sample of size 10 is drawn from a continuous pdf fy (y). What is 
3.10.3 the probability that the largest observation, Yj), is less than the pdf’s median, m? 


Using the formula for fy: (Y) = f¥mx(y) given in the proof of Theorem 3.10.1, it 
is certainly true that 


m 


P(Yj9 <m)= / 10 fr) LFy (y) Pdy (3.10.1) 


but the problem does not specify fy(y), so Equation 3.10.1 is of no help. 
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Example 


3.10.4 


Theorem 
3.10.2 


Fortunately, a much simpler solution is available, even if fy(y) were specified: 
The event “Yj, <m” is equivalent to the event “Yj <mN Y2<mN---N Yio <m.” 
Therefore, 


P(Yig <m) = P(Y, <m, YY. <m,...,¥ig<m) (3.10.2) 


But the ten observations here are independent, so the intersection probability 
implicit on the right-hand side of Equation 3.10.2 factors into a product of ten terms. 
Moreover, each of those terms equals } (by definition of the median), so 


P(Y{o <m) = P(Y, <m)- P(Y¥2 <m)--- P(Y\o <m) 
=(4)"° 
2: 


= 0.00098 o 


To find order statistics for discrete pdfs, the probability arguments of the type used 
in the proof of Theorem 3.10.1 can be be employed. The example of finding the pdf 
of Ximin for the discrete density function py (k),k=0, 1, 2,... suffices to demonstrate 
this point. 

Given arandom sample X1, X2,..., Xn from px(k), choose an arbitrary nonneg- 


ative integer m. Recall that the cdf in this case is given by Fy(m) = > px. 
Consider the events =~ 
A=(m < Xi{Nm< X2N---Am<X,,) and 
B=(m+1<X,;Nm+1<X.N---Am4+1<X,) 


Then py,,,,(m) = P(AN B®) = P(A) — P(ANM B) = P(A) — P(B), where ANB=B, 
since BC A. 

Now P(A) = P(m < X\)- P(m < X2)-...: Pm < X,) = [1 — Fx(m — 1)]" by the 
independence of the X;. Similarly P(B) =[1 — Fx(m)]", so 


PY pin (M2) = [1 — Fy (m — 1)]" — [1 — Fx Gn)]" - 


A General Formula for f,-(y) 


Having discussed two special cases of order statistics, Yin and Ymax, we Now turn to 
the more general problem of finding the pdf for the ith order statistic, where i can 
be any integer from 1 through n. 


Let Y,, Y2,..., Y, be a random sample of continuous random variables drawn from a 
distribution having pdf fy(y) and cdf Fy(y). The pdf of the ith order statistic is given 
by 


fy(y)= Lr = Fro" fr) 


n! 
G@-Din-i! 
for\<i<n. 

Proof We will give a heuristic argument that draws on the similarity between the 
statement of Theorem 3.10.2 and the binomial distribution. For a formal induction 


proof verifying the expression given for fy/(y), see (97). 
Recall the derivation of the binomial probability function, px (k) = P(X =k) = 


(i) p*(1— p)"*, where X is the number of successes in n independent trials, and p 


Example 
3.10.5 
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is the probability that any given trial ends in success. Central to that derivation was 
the recognition that the event “X =k” is actually a union of all the different (mutu- 
ally exclusive) sequences having exactly k successes and n — k failures. Because the 
trials are independent, the probability of any such sequence is p*(1 — p)"~* and the 


number of such sequences (by Theorem 2.6.2) is n!/[k!(n — k)!] (or (i) so the 


probability that X¥ =k is the product : pda py, 

Here we are looking for the pdf of the ith order statistic at some point y—that 
is, fy:(y). AS was the case with the binomial, that pdf will reduce to a combinatorial 
term times the probability associated with an intersection of independent events. 
The only fundamental difference is that Y/ is a continuous random variable, whereas 
the binomial X is discrete, which means that what we find here will be a probability 
density function. 


i-lobs. 1 obs. n—iobs. 
oe ee ee 
Y-axis 
¥y 


Figure 3.10.2 


By Theorem 2.6.2, there are n!/[(i—1)!1!(12—1)!] ways that n observations 
can be parceled into three groups such that the ith largest is at the point y (see 
Figure 3.10.2). Moreover, the likelihood associated with any particular set of points 
having the configuration pictured in Figure 3.10.2 will be the probability that i — 1 
(independent) observations are all less than y, n —i observations are greater than y, 
and one observation is at y. The probability density associated with those constraints 
for a given set of points would be [Fy(y)]'~'[1 — Fy(y)]""' fy(y). The probability 
density, then, that the ith order statistic is located at the point y is the product 


[Fy (y) 10 — Fro" fr) 


n! 
frQy= G=ie=a 


Suppose that many years of observation have confirmed that the annual maximum 
flood tide Y (in feet) for a certain river can be modeled by the pdf 


1 
Fr = 5p: 20< y <40 


(Note: It is unlikely that flood tides would be described by anything as simple as a 
uniform pdf. We are making that choice here solely to facilitate the mathematics.) 
The Army Corps of Engineers is planning to build a levee along a certain portion 
of the river, and they want to make it high enough so that there is only a 30% 
chance that the second worst flood in the next thirty-three years will overflow the 
embankment. How high should the levee be? (We assume that there will be only 
one potential flood per year.) 

Let h be the desired height. If Y,, Y2,..., ¥33 denote the flood tides for the next 
n= 33 years, what we require of h is that 


P(Y3, > h) =0.30 
As a Starting point, notice that for 20 < y < 40, 


y 


y 1 
Fyyy= | —dy=2-1 
20 
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Figure 3.10.3 


Therefore, 
33! sy 31 y\l 1 
! = 1 2 id 
Ire = san) G ) ( ) 20 
and h is the solution of the integral equation 
40 31 1 d 
y y hg 
a0) (2-= 1) ii =) = =050 3.103 
J enen(Z-1) e-2) 2 (3.10.3) 
If we make the substitution 
u= _ 1 
20 
Equation 3.10.3 simplifies to 
1 
P(¥i, > h) =33(32) wi(1—u) du 
(h/20)—1 
h 32 h 33 
=1- —-1 2(—-1 3.10.4 
33 (= ) +3 (= ) ( ) 


Setting the right-hand side of Equation 3.10.4 equal to 0.30 and solving for h by trial 
and error gives 


h= 39.3 feet Oo 


Joint Pdfs of Order Statistics 


Finding the joint pdf of two or more order statistics is easily accomplished by gen- 
eralizing the argument that derived from Figure 3.10.2. Suppose, for example, that 
each of n observations in a random sample has pdf fy(y) and cdf Fy(y). The joint 
pdf for order statistics Y/ and Y; at points u and v, where i < j and u <v, can be 
deduced from Figure 3.10.3, which shows how the n points must be distributed if the 
ith and jth order statistics are to be located at points u and v, respectively. 


i-lobs. Yi  j-i-1obs. Yj n-—jobs. 


a Ee es oe a Y-axis 


u v 


By Theorem 2.6.2, the number of ways to divide a set of n observations into 
groups of sizes i — 1,1, 7 —i—1, 1, andn — j is the quotient 


n! 
G—DMNG—-i-D!m—/! 
Also, given the independence of the n observations, the probability that i — 1 are less 
than u is [Fy(u)]'~!, the probability that j — i — 1 are between u and v is [Fy(v) — 
Fy(u)|/~‘~!, and the probability that n — j are greater than v is [1 — Fy(v)]"-/. Multi- 
plying, then, by the pdfs describing the likelihoods that Y/ and Y; would be at points 
u and v, respectively, gives the joint pdf of the two order statistics: 


n! 
G=Dpa=t=—Dig—7) 
[1— Fy)" fr) fr) (3.10.5) 


fori <j andu<v. 


higey (u,v) = 


[Fy (I'L Fy (v) — Fy(u))i—!. 
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Example Let Y;, %, and Y3; be a random sample of size n = 3 from the uniform pdf defined 
3.10.6 over the unit interval, fy(y) =1, 0 < y < 1. By definition, the range, R, of a sample is 
the difference between the largest and smallest order statistics —in this case, 


R=range = Vine — Yuin = ¥3 — ¥] 


Find fr(r), the pdf for the range. 

We will begin by finding the joint pdf of Y; and Y;. Then fy y/(u, v) is integrated 
over the region Y; — Y; <r to find the cdf, Fe(r) = P(R <r). The final step is to 
differentiate the cdf and make use of the fact that fr(r) = Fp(r). 

If fy(y) =1,0< y <1, it follows that 


0, y<0O 
Fy(y)=PY¥<y)=4y, O<yK<1 
1. y>l 


Applying Equation 3.10.5, then, with n =3,i=1, and j =3, gives the joint pdf of Y; 
and Y;. Specifically, 


"Goan ai" te 


3! 
ONO!” 


=6(u—-u), O<u<v<l 


Fri yu, v) = 


Moreover, we can write the cdf for R in terms of Yj and Y3: 
Fr(r) = P(R <r) =P(¥3—Y, <r) =P(¥3<¥, +r) 


Figure 3.10.4 shows the region in the Y;Y;-plane corresponding to the event that 
R <r. Integrating the joint pdf of Y; and Y; over the shaded region gives 


l-r u+r 1 1 
Fe(r) = PR sr) = [ / (vw) dv du f / 6(v—u) dv du 
0 u l-r Ju 


u-axis 


Figure 3.10.4 


The first double integral equals 3r? — 3r?; the second equals r3. Therefore, 
Fr(r)=3r? — 3 +r =3r* — 2r° 
which implies that 


Frit) = Fe(r) =6r —6r?, O<r<1 
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3.10.1. Suppose the length of time, in minutes, that you 
have to wait at a bank teller’s window is uniformly dis- 
tributed over the interval (0, 10). If you go to the bank 
four times during the next month, what is the probabil- 
ity that your second longest wait will be less than five 
minutes? 


3.10.2. A random sample of size n = 6 is taken from the 
pdf fy(y) =3y?, 0<y <1. Find P(¥! > 0.75). 


3.10.3. What is the probability that the larger of two ran- 
dom observations drawn from any continuous pdf will 
exceed the sixtieth percentile? 


3.10.4. A random sample of size 5 is drawn from the pdf 
fr(y) =2y, O< y <1. Calculate P(Y/ < 0.6 < Y2). (Hint: 
Consider the complement.) 


3.10.5. Suppose that Y,, Y2,..., ¥, is a random sample of 
size n drawn from a continuous pdf, f(y), whose median 
is m. Is P(Y/ > m) less than, equal to, or greater than 
P(Y/>m)? 


3.10.6. Let Y,, Yo, ..., Y, be a random sample from the 
exponential pdf f,(y) =e~’, y > 0. What is the smallest n 
for which P (Yin < 0.2) > 0.9? 


3.10.7. Calculate P(0.6 < Y; < 0.7) if a random sample 
of size 6 is drawn from the uniform pdf defined over the 
interval [0, 1]. 


3.10.8. A random sample of size n = 5 is drawn from the 
pdf fy(y) =2y, 0< y <1. On the same set of axes, graph 
the pdfs for Y>, Y/, and Y.. 


3.10.9. Suppose that n observations are taken at random 
from the pdf 
f= ee WY 
= ———_e , —oO<y<0oo 
"Tix ©) 
What is the probability that the smallest observation is 
larger than twenty? 


3.10.10. Suppose that n observations are chosen at ran- 
dom from a continuous pdf f(y). What is the probability 
that the last observation recorded will be the smallest 
number in the entire sample? 


3.10.11. In a certain large metropolitan area, the pro- 
portion, Y, of students bused varies widely from school 
to school. The distribution of proportions is roughly 
described by the following pdf: 


fy) 


y 
0 1 


Suppose the enrollment figures for five schools selected 
at random are examined. What is the probability that the 
school with the fourth highest proportion of bused chil- 
dren will have a Y value in excess of 0.75? What is the 
probability that none of the schools will have fewer than 
10% of their students bused? 


3.10.12. Consider a system containing n components, 
where the lifetimes of the components are indepen- 
dent random variables and each has pdf fy(y) = Ae’, 
y > 0. Show that the average time elapsing before the first 
component failure occurs is 1/nd. 


3.10.13. Let Y,, Y2, ..., Y, be a random sample from a 

uniform pdf over [0, 1]. Use Theorem 3.10.2 to show that 
be ig. @-DMn-i)! 

fo MG — yy idy = 


3.10.14. Use Question 3.10.13 to find the expected value 
of Y/, where Y;, Y2, ..., Y, is a random sample from a 
uniform pdf defined over the interval [0, 1]. 


3.10.15. Suppose three points are picked randomly from 
the unit interval. What is the probability that the three are 
within a half unit of one another? 


3.10.16. Suppose a device has three independent compo- 
nents, all of whose lifetimes (in months) are modeled by 
the exponential pdf, fy (vy) =e7’, y > 0. What is the proba- 
bility that all three components will fail within two months 
of one another? 


3.11 Conditional Densities 


We have already seen that many of the concepts defined in Chapter 2 relating to the 
probabilities of events—for example, independence —have random variable coun- 
terparts. Another of these carryovers is the notion of a conditional probability, or, 
in what will be our present terminology, a conditional probability density function. 
Applications of conditional pdfs are not uncommon. The height and girth of a tree, 


Example 
3.11.1 
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for instance, can be considered a pair of random variables. While it is easy to mea- 
sure girth, it can be difficult to determine height; thus it might be of interest to a 
lumberman to know the probabilities of a ponderosa pine’s attaining certain heights 
given a known value for its girth. Or consider the plight of a school board member 
agonizing over which way to vote on a proposed budget increase. Her task would be 
that much easier if she knew the conditional probability that x additional tax dol- 
lars would stimulate an average increase of y points among twelfth graders taking a 
standardized proficiency exam. 


Finding Conditional Pdfs for Discrete Random Variables 


In the case of discrete random variables, a conditional pdf can be treated in the 
same way as a conditional probability. Note the similarity between Definitions 3.11.1 
and 2.4.1. 


Definition 3.11.1. Let X and Y be discrete random variables. The conditional 
probability density function of Y given x —that is, the probability that Y takes 
on the value y given that X is equal to x —is denoted py),(y) and given by 


Px,y(*, y) 
Px (x) 


Pyjx(Qy)=PY=y|X=x)= 


for px(x) £0. 


A fair coin is tossed five times. Let the random variable Y denote the total number 
of heads that occur, and let X denote the number of heads occurring on the last two 
tosses. Find the conditional pdf py),(y) for all x and y. 

Clearly, there will be three different conditional pdfs, one for each possible value 
of X (x =0, x =1, and x =2). Moreover, for each value of x there will be four possible 
values of Y, based on whether the first three tosses yield zero, one, two, or three 
heads. 

For example, suppose no heads occur on the last two tosses. Then X = 0, and 


Pyjo(y) = P(Y = y| X =0) = P(y heads occur on first three tosses) 
3\ (1\°” iy 
= eS anaes 
(5) 2) G2) 
BN TN? 
=( )() é y=0,1,2,3 
y 2 


Now, suppose that X = 1. The corresponding conditional pdf in that case 
becomes 


Pyjx(y) = PY =y|X=)) 


Notice that Y = 1 if zero heads occur in the first three tosses, Y =2 if one head occurs 
in the first three trials, and so on. Therefore, 


3 1 yal 1 3-(y-]) 
mol) (9 
3 
re ee 
y- 
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Similarly, 


3 ° 
pra)= PY =91X=2=(,,)(5) - y=2,3,4,5 


Figure 3.11.1 shows the three conditional pdfs. Each has the same shape, but the 
possible values of Y are different for each value of X. 


x=S2 
3 
Pypy) § 
8 
x=1 
3 
Py) 4 
8 
3 x=0 
Py) ; 
Y-axis 
Figure 3.11.1 = 
Example Assume that the probabilistic behavior of a pair of discrete random variables X and 


3.11.2 Y is described by the joint pdf 


px,y(x, y)=xy"/39 


defined over the four points (1, 2), (1, 3), (2, 2), and (2, 3). Find the conditional 
probability that X = 1 given that Y =2. 


By definition, 
Pxj21) = P(X =1 given that Y = 2) 
_ pxy(1,2) 
Py (2) 

= 1.27/39 

~ 1.22/39 42.22/39 

=1/3 = 
Example Suppose that X and Y are two independent binomial random variables, each defined 

3.11.3 on n trials and each having the same success probability p. Let Z= X + Y. Show that 


the conditional pdf pyx|,(x) is a hypergeometric distribution. 
We know from Example 3.8.2 that Z has a binomial distribution with parameters 
2n and p. That is, 


2n\ , — 
pz(z)= P(Z=z)= : poi-—p)"*, z=0,1,...,2n. 


3.11 Conditional Densities 


By Definition 3.11.1, 


X,Z 
minstieypsoee 
pz(z) 
_ P(X =x and Z=z) 
_ P(X =x and Y=z—x) 
P(X =x)-P(Y¥Y =z—x) | 
~ (because X and Y are independent) 
P(Z=z) 


n n Z z—x 
( )ora = p)"* - ( Joma -_ py © x) 
~ Xx Za%X 
~ 2n\ Speck 
( ora =p 
Zz 
Xx LX. 
Zz 


which we recognize as being the hypergeometric distribution. 
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Comment The notion of a conditional pdf generalizes easily to situations involving 
more than two discrete random variables. For example, if X, Y, and Z have the joint 
pdf px.y.z(x, y, z), the joint conditional pdf of, say, X and Y given that Z =z is the 


ratio 
Px.y.z(x, ys Zz) 


Pxy|jz(%, y= ae 


Example Suppose that random variables X, Y, and Z have the joint pdf 


3.11. 
? Px.y,z(X, y, Zz) =xy/9z 


for points (1, 1, 1), (2, 1, 2), (1, 2, 2), (2, 2, 2), and (2, 2, 1). Find py y),(x, y) for all 


values of z. 


To begin, we see from the points for which py y,z(x, y, z) is defined that Z has 


two possible values, 1 and 2. Suppose z = 1. Then 


Pxy,z(x, y, 1) 
Px.yji%, Y) = 
pz(1) 
But 
pzQ) = P(Z=1)=P(0,1, DU, 2, 1] 
=1].-—-142.--1 
9 7 9 
=3 
9 
Therefore, 


ay 


pxyi@,y)=——=xy/5 for (x,y)=(1,1) and (2,2) 


> 
9 
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Suppose z= 2. Then 


pz(2) = P(Z=2)= P[(, 1,2) UC, 2, 2) U (2, 2, 2)] 


SO 


Px,yj2(x, y) 


Questions 


3.11.1. Suppose X and Y have the joint pdf py y(x, y) = 
<=) for the points (1, 1), (1, 2), (2, 1), (2, 2), where X 
denotes a “message” sent (either x = 1 or x =2) and Y 
denotes a “message” received. Find the probability that 
the message sent was the message received—that is, find 


Py\x(y). 


3.11.2. Suppose a die is rolled six times. Let X be the total 
number of 4’s that occur and let Y be the number of 4’s in 
the first two tosses. Find py), (y). 


3.11.3. An urn contains eight red chips, six white chips, 
and four blue chips. A sample of size 3 is drawn with- 
out replacement. Let X denote the number of red chips 
in the sample and Y, the number of white chips. Find an 
expression for py). (y). 


3.11.4. Five cards are dealt from a standard poker deck. 
Let X be the number of aces received, and Y the number 
of kings. Compute P(X =2|Y =2). 


3.11.5. Given that two discrete random variables X and Y 
follow the joint pdf py. (x, y)=k+ y), for x =1,2,3 and 
y=1,2,3, 


(a) Find k. 
(b) Evaluate py), (1) for all values of x for which p,(x) > 0. 


3.11.6. Let X denote the number on a chip drawn at ran- 
dom from an urn containing three chips, numbered 1, 2, 
and 3. Let Y be the number of heads that occur when a 
fair coin is tossed X times. 


=2 I i 5 é 
~~ 18 18 18 
_ 8 
~ 18 
Px,y,z(x, y, 2) 
pz(2) 
x-y/18 
8 
18 
vty 


for (x, y)=(2,1), (1,2), and (2, 2) 


(a) Find py y(x, y). 
(b) Find the marginal pdf of Y by summing out the x 
values. 


3.11.7. Suppose X, Y, and Z have a trivariate distribution 
described by the joint pdf 
xXy+xz+yz 

54 


where x, y, and z can be 1 or 2. Tabulate the joint condi- 
tional pdf of X and Y given each of the two values of z. 


Pxyz(X,y,D= 


3.11.8. In Question 3.11.7 define the random variable 
W to be the “majority” of x, y, and z. For example, 
W(2, 2, 1)=2 and W(1, 1, 1)=1. Find the pdf of W|x. 


3.11.9. Let X and Y be independent random variables 


where p,(k) = ek and py(k) =e fork =0,1,.... 


Show that the conditional pdf of X given that X + Y=n 
is binomial with parameters n and ee (Hint: See Ques- 


tion 3.8.1.) 


3.11.10. Suppose Compositor A is preparing a manuscript 
to be published. Assume that she makes X errors on a 
given page, where X has the Poisson pdf, px (k) =e-?2*/k!, 
k=0,1,2,.... A second compositor, B, is also work- 
ing on the book. He makes Y errors on a page, where 
py(k) = e 33*/k!, kK =0,1,2,.... Assume that Composi- 
tor A prepares the first one hundred pages of the text 
and Compositor B, the last one hundred pages. After the 
book is completed, reviewers (with too much time on their 
hands!) find that the text contains a total of 520 errors. 
Write a formula for the exact probability that fewer than 
half of the errors are due to Compositor A. 
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Finding Conditional Pdfs for Continuous Random Variables 


If the variables X and Y are continuous, we can still appeal to the quotient 
fx.y (x, y)/fx (x) as the definition of fy|,(y) and argue its propriety by analogy. A 
more satisfying approach, though, is to arrive at the same conclusion by taking the 
limit of Y’s “conditional” cdf. 

If X is continuous, a direct evaluation of Fy|,(y) = P(Y < y|X =x), via Defini- 
tion 2.4.1, is impossible, since the denominator would be zero. Alternatively, we can 
think of P(Y < y|X =x) as a limit: 


PY sy|X=x)= limPY s ylasXsx+h) 
> 


xth py 
i: i Ixy (t, u) dudt 
: x —0o 


= lim 


pases x+h 
/ fx(t) dt 


Evaluating the quotient of the limits gives , so l’H6pital’s rule is indicated: 


xt+th py 
4 / fxy(t,u)dudt 
x —0o 


PY <y|X =x) = lim (3.11.1) 


x+h 
4 fx(t)dt 


By the Fundamental Theorem of Calculus, 


x+h 
— dt= h 
ith. g(t) dt=g(x +h) 


which simplifies Equation 3.11.1 to 


; 
/ fx,y[@+h),u] du 
P(Y <y|X =x) = lim —* ES 
h— JX(x 


y 
fart pr pte 4 
—coo fx(x) 


lim fx (x +h) 


provided that the limit operation and the integration can be interchanged [see (8) 
for a discussion of when such an interchange is valid]. It follows from this last 
expression that fx,y(x, y)/fx(x) behaves as a conditional probability density func- 
tion should, and we are justified in extending Definition 3.11.1 to the continuous 
case. 


Example Let X and Y be continuous random variables with joint pdf 
3.11.5 


1 
(;) (6-—x—y), O<x<2, 2<y<4 


0, elsewhere 


fxy(%, y= 


Find (a) fx(x), (b) frjx(y), and (c) P(2< ¥ <3|x =1). 
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a. From Theorem 3.7.2, 


fore) 4 1 
f(x) =i en) ay= | (5) 63-94 


-(5) (6—2x), O<x<2 


b. Substituting into the “continuous” statement of Definition 3.11.1, we can write 


frixQ) = 


fxy(x,y) (3) @-x-y) 


fx) 


(7) @—2x) 


c. To find P(2< Y <3|x =1), we simply integrate fy), (y) over the interval 2<Y <3: 


3 
PQ<¥<3x=1)= | frii(y) dy 
2 


3 

35-y 
Lo 
5 


8 


[A partial check that the derivation of a conditional pdf is correct can be performed 
by integrating fy|,(y) over the entire range of Y. That integral should be 1. Here, for 


example, when x = 1, [°>. fyji(v) dy = ri [(5 — y)/4] dy does equal 1.] = 
Questions 
3.11.11. Let X be a nonnegative random variable. We say 3.11.14. If 
that X is memoryless if 
fxy@,y)=2, x20, y>=0, x+y<l 


P(X >s4+t|X >t)=P(X>s) _ foralls,t>0 


Show that a random variable with pdf fx(~) = 
(1/Aje*/*, x > 0, is memoryless. 


3.11.12. Given the joint pdf 
Fay, y)=2e-O, 


O<x<y, y=0 


find 


(a) P(Y <1|X <1). 
(b) P(Y <1|X=1). 
(ce) frir(). 
(d) E(Y|x). 


3.11.13. Find the conditional pdf of Y given x if 


fxy@,y)=x+y 


forO<x<land0<y<l. 


show that the conditional pdf of Y given x is uniform. 


3.11.15. Suppose that 


2y +4x 
1+4x 


1 
fre) = and E@)=s°U 14x) 


for 0 <x <1and0<y <1. Find the marginal pdf for Y. 


3.11.16. Suppose that X and Y are distributed according 
to the joint pdf 


2 
fxx@ Y= 5° Ox + 3y), O<x<l, O<y<l 


Find 


(a) fx(x). 

(b) fr). 
(c) P(4<Y 
(d) E(Y|x). 


IA 
Rie 
Be 
ll 
Le 
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3.11.17. If X and Y have the joint pdf for 0<x; <1,i=1,2,...,5. Find the joint conditional pdf 


fxy(X, y) =2, 


find P(0< X <$|Y=3). 


3.11.18. Find P(X < 1|Y = 1%) if X and Y have the 


joint Pdf 


fx, y) =xy/2, 


of X,, X2, and X; given that X,=x, and X5= xs. 
O<x<y<l 


3.11.20. Suppose the random variables X and Y are 
jointly distributed according to the Pdf 


6 
frre W=a(e+ >), O<x<l, 0<y<2 


O<x<y<2 Find 


3.11.19. Suppose that X,, X2, X3, X4, and X; have the (a) fy(x). 


joint pdf 


Tx) .X),X3,X4,Xs (X1, X25 X3, X4, X5) = 32K XyX3X4X5 


Example 
3.12.1 


(b) P(X >2Y). 
(c) P(Y > 1|X > 3). 


3.12 Moment-Generating Functions 


Finding moments of random variables directly, particularly the higher moments 
defined in Section 3.6, is conceptually straightforward but can be quite problematic: 
Depending on the nature of the pdf, integrals and sums of the form {°, y’ fy(y) dy 


and 5° k" px(k) can be very difficult to evaluate. Fortunately, an alternative method is 
allk 


available. For many pdfs, we can find a moment-generating function (or mgf), My (t), 
one of whose properties is that the rth derivative of My (t) evaluated at zero is equal 
to E(W’). 


Calculating a Random Variable’s Moment-Generating Function 


In principle, what we call a moment-generating function is a direct application of 
Theorem 3.5.3. 


Definition 3.12.1. Let W be a random variable. The moment-generating func- 
tion (mgf) for W is denoted My (t) and given by 


SS e'* pw (k) if W is discrete 


Mw(t)= Ee) = 4 8 
e'” fw(w) dw if Wis continuous 
—00 


at all values of t for which the expected value exists. 


Suppose the random variable X has a geometric pdf, 


px(k)=(1—p)*'p, k=1,2,... 


[In practice, this is the pdf that models the occurrence of the first success in a series 
of independent trials, where each trial has a probability p of ending in success (recall 
Example 3.3.2)]. Find My (t), the moment-generating function for X. 

Since X is discrete, the first part of Definition 3.12.1 applies, so 


Mx(t)= E(e*) =) e*(1— p)*'p 
k=l 
P~ tk k P~ tyk 
= —— e“(l- p= [1 — p)e’] (3.12.1) 
l-p ye l—p d, 
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The ¢ in Mx(t) can be any number in a neighborhood of zero, as long as Mx (t) < 00. 
Here, Mx(t) is an infinite sum of the terms [(1 — p)e’]*, and that sum will be finite 
only if (1 — p)e’ < 1, or, equivalently, if t <In[1/(1 — p)]. It will be assumed, then, in 
what follows that 0 < +t <In[1/(1 — p)]. 

Recall that 


1 


l-r 


[o,0) 
) r= 


k=0 


provided 0 <r < 1. This formula can be used on Equation 3.12.1, where r = (1 — p)e’ 
and 0 <t <In[ q+, ]. Specifically, 


My(t)= is bs [a — pet} -[a- pet 


k=0 

eer: | 1 

~ 1—p|1—(— pie 

eee 

~ 1-(1= piet = 
Example Suppose that X is a binomial random variable with pdf 

3.12.2 
ny k n—k 
pxt=(f)p (l—p)"“*, k=0,1,...,n 


Find My (t). 
By Definition 3.12.1, 


Mx(t)= E(e'*)= Ye" (7) pha py 
k=0 


= (7) ety — py (3.12.2) 
k=0 


To get a closed-form expression for Mx (t)—that is, to evaluate the sum indicated in 
Equation 3.12.2—requires a (hopefully) familiar formula from algebra: According 
to Newton’s binomial expansion, 


(x+y)"= es i? yn (3.12.3) 
k=0 


for any x and y. Suppose we let x = pe’ and y=1 — p. It follows from Equa- 
tions 3.12.2 and 3.12.3, then, that 


Mx(t)=(1— p+ pe'y” 
[Notice in this case that My (t) is defined for all values of t.] = 
Example Suppose that Y has an exponential pdf, where fy(y)=Ae~”, y > 0. Find My(t). 


3.12.3 Since the exponential pdf describes a continuous random variable, My(t) is an 
integral: 


Example 
3.12.4 
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oe) 
My (t) = E(e’”) ay ere” dy 
0 


CO 
-| re P-YY dy 
0 


After making the substitution u = (A — t)y, we can write 


Here, My(t) is finite and nonzero only when u = (A — t)y > 0, which implies that t 
must be less than A. For t >A, My(t) fails to exist. = 


The normal (or bell-shaped) curve was introduced in Example 3.4.3. Its pdf is the 
rather cumbersome function 


2 
f(y) = (1/V2z0) exp -5 =") |: ieee 


oO 


where pp = E(Y) and o* = Var(Y). Derive the moment-generating function for this 
most important of all probability models. 
Since Y is a continuous random variable, 


(oe) 


2 
My (t) = E(e”) = (1/V2x0) / exp(ty) exp 5 (2 - “) dy 
r y? — 2py — 207ty + p? 
= (1/270) i exp | a Jo (3.12.4) 


Evaluating the integral in Equation 3.12.4 is best accomplished by completing the 
square of the numerator of the exponent (which means that the square of half the 
coefficient of y is added and subtracted). That is, we can write 


y? —(2u+207t)y + (w+o7t) —(utorty +p? 
=[y—(ut+o7t)P —o4t? + 2pto? (3.12.5) 


The last two terms on the right-hand side of Equation 3.12.5, though, do not 
involve y, so they can be factored out of the integral, and Equation 3.12.4 reduces 
to 


oO 


242 oO _ bad 
My) =exp (ur+ S) ime) | ex Ak (u+to |e 


—oo 


But, together, the latter two factors equal 1 (why?), implying that the moment- 
generating function for a normally distributed random variable is given by 


My (t) = elittort? /2 i 
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Questions 


3.12.1. Let X be a random variable with pdf p x(k) = 1/n, 
fork=0,1,2,...,n—1 and 0 otherwise. Show that My (t)= 
|—et 

n(1—e!) * 

3.12.2. Two chips are drawn at random and without 
replacement from an urn that contains five chips, num- 
bered 1 through 5. If the sum of the chips drawn is 
even, the random variable X equals 5; if the sum of the 
chips drawn is odd, X = —3. Find the moment-generating 
function for X. 


3.12.3. Find the expected value of e** if X is a binominal 
random variable with n = 10 and p= i. 


3.12.4. Find the moment-generating function for the dis- 
crete random variable X whose probability function is 


given by 
ag Al 
kKy=(-— —), k=0,1,2,... 
r0e(2)Q) 


3.12.5. Which pdfs would have the following moment- 
generating functions? 


(a) My(t)=e" 

(b) My(t)=2/2—1) 

(c) My(t)= (4+ 4e') 

(d) My(t) =0.3e'/(1 —0.7e') 


3.12.6. Let Y have pdf 


y, O<y<l 
frO)= }2-y, Is<y<2 
0, elsewhere 


Find My(t). 


3.12.7. A random variable X is said to have a Poisson 
distribution if py(k) = P(X =k) =e*A*‘/k!,k=0,1,2,.... 
Find the moment-generating function for a Poisson ran- 
dom variable. Recall that 


3.12.8. Let Y be a continuous random variable with 
fv) = ye’, 0< y. Show that My(t) = 


1 
(1-1)? * 


Using Moment-Generating Functions to Find Moments 


Having practiced finding the functions My (t) and My (t), we now turn to the theorem 
that spells out their relationship to X’ and Y’. 


Theorem 
3.12.1 


Let W be a random variable with probability density function fy(w). [If W is con- 
tinuous, fw(w) must be sufficiently smooth to allow the order of differentiation and 


integration to be interchanged.] Let Mw(t) be the moment-generating function for W. 
Then, provided the rth moment exists, 


M\ (0) = E(W’) 


Proof We will verify the theorem for the continuous case where r is either 1 or 2. 
The extensions to discrete random variables and to an arbitrary positive integer r 


are straightforward. 
Forr=1, 


d ee 
Mo= = | e” f(y) dy 


= ‘i ye” fy(y) dy 


CO d “ 
= —= d 
[. deo Sy (y) dy 


t=0 t=0 


= ye” fy(y) dy 


t=0 


/ yfyQ)dy=E(Y) 
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For r=2, 


(2) a ee) 
My’ (0) = ae / e”” fy(y) dy 
—0oo 


lo) a 
= / <5” f(y) dy 
1=0 oo at 


t=0 


“ i ye? f(y) dy 


= / ye?” fy(y) dy 


t=0 


= / y? f(y) dy = E(?) 


Example For a geometric random variable X with pdf 
— px(=(1— pp, k=1,2,... 
we saw in Example 3.12.1 that 
Mx(t) = pe'[1—(1— p)e'T! 


Find the expected value of X by differentiating its moment-generating function. 
Using the product rule, we can write the first derivative of Mx(t) as 


MY (t) = pe'(-—DN — (1 — poe) 7(-D — poe’ + [1-1 — p) eT! pe’ 
p(i — p)e” pe! 
(i1-(—p)e'P © 1-1 — poet 
Setting t = 0 shows that E(X) = = 


(1 _ je? 0 eo 
My (0) = E(X) = 55 
== pete ie Gl ple 

_ PUl=p) . -P 

= 5 i 

7 1 P P 

: = 
Example Find the expected value of an exponential random variable with pdf 
3.12.6 ; 
fr(y)=re,  y>0 

Use the fact that 

My (t)=AQ—t)7! 
(as shown in Example 3.12.3). 

Differentiating My (t) gives 
My? O=A-DA-N7CD 
A 
~ G=4P 

Set t=0. Then 

MP 0) = —~ 
implying that 

1 
E(Y)=5 = 
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Example 


3.12.7 


Example 
3.12.8 


Find an expression for E(X*) if the moment-generating function for X is given by 
Mx(t)=(1— pi — p2) + pie! + pre” 


The only way to deduce a formula for an arbitrary moment such as E(X*) is to 
calculate the first couple moments and look for a pattern that can be generalized. 
Here, 


MY (t) = pie! +2pre” 
so 
E(X) = M® (0) = pie? +. 2pne?'® 
= pi +2p2 
Taking the second derivative, we see that 
My’ (t) = pie! +2? pre 
implying that 
E(X*) = My? (0) = pie® +2” pre?” 
=p +2’ po 


Clearly, each successive differentiation will leave the p, term unaffected but will 
multiply the p2 term by 2. Therefore, 


E(X*) =M® 0) =p. +2* pp 7 


Using Moment-Generating Functions to Find Variances 


In addition to providing a useful technique for calculating E(W"’), moment- 
generating functions can also find variances, because 


Var(W) = E(W7) —[E(W)/ (3.12.6) 


for any random variable W (recall Theorem 3.6.1). Other useful “descriptors” 
of pdfs can also be reduced to combinations of moments. The skewness of a 
distribution, for example, is a function of E[(W — )*], where uw = E(W). But 


E[(W — )*]= E(W*) —3E(W*) E(W) + 2[E(W) > 


In many cases, finding E[(W — )*] or E[(W — )*] could be quite difficult if 
moment-generating functions were not available. 


We know from Example 3.12.2 that if X is a binomial random variable with 
parameters n and p, then 


Mx(t)=(1— p+ pe')” 


Use My (t) to find the variance of X. 
The first two derivatives of My(t) are 


My (t)=n(1 — p+ pe'y""!- pel 


Example 
3.12.9 
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and 
My (t) = pe! -n(n—1)(1— p+ pe)" - pe! +n(1 = p+ pe'y"| - pe! 
Setting t = 0 gives 
My (0) =np = E(X) 

and 

M® (0) =n(n— 1)p? +np = E(X”) 
From Equation 3.12.6, then, 

Var(X) =n(n — 1) p* + np — (np)? 

=np(1— p) 


(the same answer we found in Example 3.9.9). = 


A discrete random variable X is said to have a Poisson distribution if 
ek 
kt? 


px(k)= P(X=k)= k=O, 120 


(An example of such a distribution is the mortality data described in Case 
Study 3.3.1.) It can be shown (see Question 3.12.7) that the moment-generating 
function for a Poisson random variable is given by 


My (t) = eo tie! 


Use My (t) to find E(X) and Var(X). 
Taking the first derivative of Mx(t) gives 


MY (t) = ent tie! 4 ot 
sO 
E(X) = MQ) =e" . 26° 
=i 
Applying the product rule to M a) yields the second derivative, 
Me? Gee je tiger de 
For t =0, 
Me (0) = E(X?) — eaatie? _ 4 40 +i. eh the? tee 
ye 
The variance of a Poisson random variable, then, proves to be the same as its mean: 
Var(X) = E(X*) — [E(X)]’ 
= My (0) — [MO] 
=i 40-1 


=X o 


214 Chapter 3 Random Variables 
Questions 


3.12.9. Calculate E(Y*) for a random variable whose 
moment-generating function is My (t) =e". 


3.12.10. Find E(Y*) if Y is an exponential random vari- 
able with f;(y) =Ae’, y > 0. 


3.12.11. The form of the moment-generating function 
for a normal random variable is My(t) = e+"? (recall 
Example 3.12.4). Differentiate My(t) to verify that a = 
E(Y) and b? = Var(Y). 

3.12.12. What is E(Y*) if the random variable Y has 
moment-generating function My (t) = (1—at)*? 


3.12.13. Find E(Y’) if the moment-generating function 
for Y is given by My(t) = e+" Use Example 3.12.4 to 
find E(Y’) without taking any derivatives. (Hint: Recall 
Theorem 3.6.1.) 


3.12.14. Find an expression for E(Y*) if My(t) = (1 — 
t/X)~", where i is any positive real number and r is a 
positive integer. 


3.12.15. Use My(r) to find the expected value of the 
uniform random variable described in Question 3.12.1. 


3.12.16. Find the variance of Y if My(t)=e”/(1 —?°). 


Using Moment-Generating Functions to Identify Pdfs 


Finding moments is not the only application of moment-generating functions. They 
are also used to identify the pdf of sums of random variables — that is, finding fy (w), 
where W=W, + W2+---+ W,,. Their assistance in the latter is particularly important 
for two reasons: (1) Many statistical procedures are defined in terms of sums, and 
(2) alternative methods for deriving fw,+w,+.-+w, (w) are extremely cumbersome. 

The next two theorems give the background results necessary for deriving 
fw(w). Theorem 3.12.2 states a key uniqueness property of moment-generating 
functions: If W; and W> are random variables with the same megfs, they must nec- 
essarily have the same pdfs. In practice, applications of Theorem 3.12.2 typically 
rely on one or both of the algebraic properties cited in Theorem 3.12.3. 


Suppose that W, and W> are random variables for which My, (t) = My,(t) for some 


a. Let W be a random variable with moment-generating function My (t). Let V = 
My(t) =e" Mw (at) 
., W, be independent random variables with moment-generating 


functions My, (t), My,(t),..., and My, (t), respectively. Let W = W, + W2+---+ 


My (t) = My, (t)- Mw, (t)--- My, (0) 


Suppose that X; and X, are two independent Poisson random variables with 


Theorem 
3.12.2 interval of t’s containing 0. Then fw,(w) = fw, (w). 
Proof See (95). 
Theorem 
3.12.3 aW +b. Then 
b. Let Wi, W2,.. 
W,,. Then 
Proof The proof is left as an exercise. 
Example 
3.12.10 parameters 4, and Ag, respectively. That is, 


eo Aik 
k! 


Px, (kh) = P(X, =k) = ge OAL a 


Example 
3.12.11 
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and 
e2Aok 
kh? 


Px, (k) = P(X. =k) = k=0,1,2... 


Let X = X,; + X>. What is the pdf for X? 
According to Example 3.12.9, the moment-generating functions for X; and 
X> are 


Mx, (t) -_ eta! 


and 


Mx, (t) = en hataze! 
Moreover, if X = X, + X2, then by part (b) of Theorem 3.12.3, 
Mx(t)= Mx, (t): Mx, (t) 


— ewitaie —AgtAge! 


-e 

= eT Oita) +i tagye! (3.12.7) 
But, by inspection, Equation 3.12.7 is the moment-generating function that a Poisson 
random variable with A =, +A, would have. It follows, then, by Theorem 3.12.2 that 


eM yp aa) 


px(k)= xl : 


k=0,1,2,... 


Comment The Poisson random variable reproduces itself in the sense that the sum 
of independent Poissons is also a Poisson. A similar property holds for independent 
normal random variables (see Question 3.12.19) and, under certain conditions, for 
independent binomial random variables (recall Example 3.8.2). = 


We saw in Example 3.12.4 that a normal random variable, Y, with mean yw and 
variance o7 has pdf 


_ 2 
f(y) = (1/V2at) exp 5 (=) , eee 


and megf 
My(t)= ebttor? /2 
By definition, a standard normal random variable is a normal random variable for 
which « =0 and o = 1. Denoted Z, the pdf and megf for a standard normal random 
variable are fz(z) = (1/V2m)e~*/?, —00 < z < 00, and Mz(t) =e /?, respectively. 
Show that the ratio 
Y-p 


Oo 
is a standard normal random variable, Z. 
Write “ as +y — 4. By part (a) of Theorem 3.12.3, 


t 
Mvy—p/o(t) = e Mtl? My (<) 
— eo Ht/o plat/ot+o7(t/o)? /2] 


17/2 
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But Mz(t) = e' /2 so it follows from Theorem 3.12.2 that the pdf for mut is the same 
as the pdf for f,(z). (We call a a Z transformation. Its importance will become 


evident in Chapter 4.) 


Questions 


3.12.17. Use Theorem 3.12.3(a) and Question 3.12.8 to 
find the moment-generating function of the random vari- 
able Y, where fy(y)=Aye’, y > 0. 


3.12.18. Let Y;, Y:, and Y3; be independent random vari- 
ables, each having the pdf of Question 3.12.17. Use The- 
orem 3.12.3(b) to find the moment-generating function 
of Y, + Y, + Y3. Compare your answer to the moment- 
generating function in Question 3.12.14. 


3.12.19. Use Theorems 3.12.2 and 3.12.3 to determine 
which of the following statements is true: 


(a) The sum of two independent Poisson random vari- 
ables has a Poisson distribution. 

(b) The sum of two independent exponential random 
variables has an exponential distribution. 

(c) The sum of two independent normal random vari- 
ables has a normal distribution. 


3.12.20. Calculate P(X <2) if My(t) =(1 + 3e')’. 


3.12.21. Suppose that Y,, Y2, ..., Y, is a random sample 
of size n from a normal distribution with mean p and 
standard deviation o. Use moment-generating functions 


= 12 
to deduce the pdf of Y= —-)°Y;. 
MN j=1 


3.12.22. Suppose the moment-generating function for a 
random variable W is given by 


e 7a Dy 
Myw(t)= —343e [= arti 
w(t)=e (5+5¢) 


Calculate P(W <1). (Hint: Write W as a sum.) 


3.12.23. Suppose that X is a Poisson random variable, 
where py(k) =e7*A*/k!,k=0,1,.... 


(a) Does the random variable W = 3X have a Poisson 
distribution? 

(b) Does the random variable W =3X + 1 have a Poisson 
distribution? 


3.12.24. Suppose that Y is a normal variable, where 


fr(y) = (1/270) exp [-: (54), —0 <y <0. 


o 


(a) Does the random variable W = 3Y have a normal 
distribution? 

(b) Does the random variable W =3Y + 1 have a normal 
distribution? 


3.13 Taking a Second Look at Statistics 
(Interpreting Means) 


One of the most important ideas coming out of Chapter 3 is the notion of the 
expected value (or mean) of a random variable. Defined in Section 3.5 as a number 
that reflects the “center” of a pdf, the expected value (w) was originally introduced 
for the benefit of gamblers. It spoke directly to one of their most fundamental 
questions—How much will I win or lose, on the average, if I play a certain game? 
(Actually, the real question they probably had in mind was “How much are you 
going to lose, on the average?”) Despite having had such a selfish, materialis- 
tic, gambling-oriented raison d’etre, the expected value was quickly embraced by 
(respectable) scientists and researchers of all persuasions as a preeminently useful 
descriptor of a distribution. Today, it would not be an exaggeration to claim that the 
majority of all statistical analyses focus on either (1) the expected value of a sin- 
gle random variable or (2) comparing the expected values of two or more random 


variables. 
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In the lingo of applied statistics, there are actually two fundamentally differ- 
ent types of “means” — population means and sample means. The term “population 
mean” is a synonym for what mathematical statisticians would call an expected 
value—that is, a population mean (jz) is a weighted average of the possible values 
associated with a theoretical probability model, either p(k) or fy(y), depending on 
whether the underlying random variable is discrete or continuous. A sample mean 
is the arithmetic average of a set of measurements. If, for example, n observations — 
Yi, 2, ---) Ya —are taken on a continuous random variable Y, the sample mean is 
denoted y, where 


Conceptually, sample means are estimates of population means, where the “quality” 
of the estimation is a function of (1) the sample size and (2) the standard deviation 
(co) associated with the individual measurements. Intuitively, as the sample size gets 
larger and/or the standard deviation gets smaller, the approximation will tend to get 
better. 

Interpreting means (either y or w) is not always easy. To be sure, what they 
imply in principle is clear enough—both y and yw are measuring the centers of their 
respective distributions. Still, many a wrong conclusion can be traced directly to 
researchers misunderstanding the value of a mean. Why? Because the distributions 
that y and/or yw are actually representing may be dramatically different from the 
distributions we think they are representing. 

An interesting case in point arises in connection with SAT scores. Each fall 
the average SATs earned by students in each of the fifty states and the Dis- 
trict of Columbia are released by the Educational Testing Service (ETS). With 
“accountability” being one of the new paradigms and buzzwords associated with 
K-12 education, SAT scores have become highly politicized. At the national level, 
Democrats and Republicans each campaign on their own versions of education 
reform, fueled in no small measure by scores on standardized exams, SATs included; 
at the state level, legislatures often modify education budgets in response to how 
well or how poorly their students performed the year before. Does it make sense, 
though, to use SAT averages to characterize the quality of a state’s education sys- 
tem? Absolutely not! Averages of this sort refer to very different distributions 
from state to state. Any attempt to interpret them at face value will necessarily be 
misleading. 

One such state-by-state SAT comparison that appeared in the mid-90s is repro- 
duced in Table 3.13.1. Notice that Tennessee’s entry is 1023, which is the tenth 
highest average listed. Does it follow that Tennessee’s educational system is among 
the best in the nation? Probably not. Most independent assessments of K-12 edu- 
cation rank Tennessee’s schools among the weakest in the nation, not among the 
best. If those opinions are accurate, why do Tennessee’s students do so well on 
the SAT? 

The answer to that question lies in the academic profiles of the students 
who take the SAT in Tennessee. Most college-bound students in that state 
apply exlusively to schools in the South and the Midwest, where admissions are 
based on the ACT, not the SAT. The SAT is primarily used by private schools, 
where admissions tend to be more competitive. As a result, the students in Ten- 
nessee who take the SAT are not representative of the entire population of 
students in that state. A disproportionate number are exceptionally strong aca- 
demically, those being the students who feel that they have the ability to be 
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Table 3.13.1 
State Average State Average 
SAT Score SAT Score 
AK 911 MT 986 
AL 1011 NE 1025 
AZ 939 NV 913 
AR 935 NH 924 
CA 895 NJ 893 
CO 969 NM 1003 
CT 898 NY 888 
DE 892 NC 860 
DC 849 ND 1056 
FL 879 OH 966 
GA 844 OK 1019 
HI 881 OR 927 
ID 969 PA 879 
IL 1024 RI 882 
IN 876 SC 838 
IA 1080 SD 1031 
KS 1044 T™N 1023 
KY 997 TX 886 
LA 1011 UT 1067 
ME 883 VT 899 
MD 908 VA 893 
MA 901 WA 922 
MI 1009 WV 921 
MN 1057 WI 1044 
MS 1013 WY 980 
MO 1017 


competitive at Ivy League-type schools. The number 1023, then, is the aver- 
age of something (in this case, an elite subset of all Tennessee students), but 
it does not correspond to the center of the SAT distribution for all Tennessee 
students. 

The moral here is that analyzing data effectively requires that we look beyond 
the obvious. What we have learned in Chapter 3 about random variables and prob- 
ability distributions and expected values will be helpful only if we take the time to 
learn about the context and the idiosyncrasies of the phenomenon being studied. To 
do otherwise is likely to lead to conclusions that are, at best, superficial and, at worst, 
incorrect. 


Appendix 3.A.1 Minitab Applications 


Numerous software packages are available for performing a variety of probability 
and statistical calculations. Among the first to be developed, and one that continues 
to be very popular, is Minitab. Beginning here, we will include at the ends of certain 
chapters a short discussion of Minitab solutions to some of the problems that were 
discussed in the chapter. What other software packages can do and the ways their 
outputs are formatted are likely to be quite similar. 


Figure 3.A.1.1 


Figure 3.A.1.2 


Figure 3.A.1.3 
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Contained in Minitab are subroutines that can do some of the more important 
pdf and cdf computations described in Sections 3.3 and 3.4. In the case of binomial 
random variables, for instance, the statements 


MTB pdf k; 
SUBC > binomial n p. 


Vv 


and 


MTB > cdf k; 
SUBC > binomial n p. 


k 
will calculate (/) p*(1 — p)""* and \° (")p"(1 — p)""", respectively. Figure 3.A.1.1 
r=0 


shows the Minitab program for doing the cdf calculation [= P(X < 15)] asked for 
in part (a) of Example 3.2.2. 

The commands pdf kand cdf kcan be run on many of the probability mod- 
els most likely to be encountered in real-world problems. Those on the list that 
we have already seen are the binomial, Poisson, normal, uniform, and exponential 
distributions. 


MTB > cdf 15; 
SUBC > binomial 30 0.60. 
Cumulative Distribution Function 
Binomial with n = 30 and p = 0.600000 
x P(X <= x) 
15.00 0.1754 


For discrete random variables, the cdf can be printed out in its entirety (that is, 
for every integer) by deleting the argument & and using the command MTB < cdf;. 
Typical is the output in Figure 3.A.1.2, corresponding to the cdf for a binomial 


random variable with n =4 and p= Z. 


MTB > cdf; 

SUBC > binomial 4 0.167. 

Cumulative Distribution Function 
Binomial with n = 4 and p =0.167000 
P( X <= x) 

-4815 

. 8676 

-9837 

-9992 

.0000 


BWHR OX 
PoOoDCOO 


MTB > invcdf 0.60; 
SUBC > exponential 1. 
Inverse Cumulative Distribution Function 
Exponential with mean = 1.00000 
P(X <= x) x 
0.6000 0.9163 


Also available is an inverse cdf command, which in the case of a continuous 
random variable Y and a specified probability p identifies the value y having the 
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property that P(Y < y)= Fy(Y)=p. For example, if p =0.60 and Y is an exponential 
random variable with pdf fy(y) =e7”, y > 0, the value y = 0.9163 has the property 
that P(Y < 0.9163) = Fy (0.9163) = 0.60. That is, 


0.9163 
Fy(0.9163) = [ e * dy=0.60 
0 


With Minitab the number 0.9163 is found by using the command MTB>invcdf 
0.60 (see Figure 3.A.1.3). 
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The Gamma Distribution 


Although he maintained lifelong literary and artistic interests, Quetelet’s 
mathematical talents led him to a doctorate from the University of Ghent and from 
there to a college teaching position in Brussels. In 1833 he was appointed astronomer 
at the Brussels Royal Observatory after having been largely responsible for its 
founding. His work with the Belgian census marked the beginning of his pioneering 
efforts in what today would be called mathematical sociology. Quetelet was well 
known throughout Europe in scientific and literary circles: At the time of his death he 
was a member of more than one hundred learned societies. 

—Lambert Adolphe Jacques Quetelet (1796-1874) 


4.1 Introduction 


To “qualify” as a probability model, a function defined over a sample space S 
needs to satisfy only two criteria: (1) It must be nonnegative for all outcomes in S, 
and (2) it must sum or integrate to 1. That means, for example, that fy(y) = 3+ 
a, 0 < y <1, can be considered a pdf because fy(y) > 0 for all 0< y <1 and 
fy (3+ F)ava1. 

It certainly does not follow, though, that every fy(y) and px(k) that satisfy 
these two criteria would actually be used as probability models. A pdf has practical 
significance only if it does, indeed, model the probabilistic behavior of real-world 
phenomena. In point of fact, only a handful of functions do [and fy(y) = 5 + a 
0<y<1,is not one of them!]. 

Whether a probability function—say, fy(y)—adequately models a given phe- 
nomenon ultimately depends on whether the physical factors that influence the 
value of Y parallel the mathematical assumptions implicit in fy(y). Surprisingly, 
many measurements (i.e., random variables) that seem to be very different are actu- 
ally the consequence of the same set of assumptions (and will, therefore, be modeled 
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Theorem 
4.2.1 


by the same pdf). That said, it makes sense to single out these “real-world” pdfs and 
investigate their properties in more detail. This, of course, is not an idea we are see- 
ing for the first time—recall the attention given to the binomial and hypergeometric 
distributions in Section 3.2. 

Chapter 4 continues in the spirit of Section 3.2 by examining five other widely 
used models. Three of the five are discrete; the other two are continuous. One of 
the continuous pdfs is the normal (or Gaussian) distribution, which, by far, is the 
most important of all probability models. As we will see, the normal “curve” figures 
prominently in every chapter from this point on. 

Examples play a major role in Chapter 4. The only way to appreciate fully the 
generality of a probability model is to look at some of its specific applications. Thus, 
included in this chapter are case studies ranging from the discovery of alpha-particle 
radiation to an early ESP experiment to an analysis of volcanic eruptions to counting 
bug parts in peanut butter. 


4.2 The Poisson Distribution 


The binomial distribution problems that appeared in Section 3.2 all had relatively 
small values for n, so evaluating px(k) = P(X =k) = ({) p‘(1— p)”* was not par- 
ticularly difficult. But suppose n were 1000 and &, 500. Evaluating px (500) would 
be a formidable task for many handheld calculators, even today. Two hundred years 
ago, the prospect of doing cumbersome binomial calculations by hand was a cat- 
alyst for mathematicians to develop some easy-to-use approximations. One of the 
first such approximations was the Poisson limit, which eventually gave rise to the 
Poisson distribution. Both are described in Section 4.2. 

Simeon Denis Poisson (1781-1840) was an eminent French mathematician and 
physicist, an academic administrator of some note, and, according to an 1826 let- 
ter from the mathematician Abel to a friend, a man who knew “how to behave 
with a great deal of dignity.” One of Poisson’s many interests was the application of 
probability to the law, and in 1837 he wrote Recherches sur la Probabilite de Juge- 
ments. Included in the latter is a limit for py (k) = (7) p‘(1 — p)"~* that holds when 
n approaches oo, p approaches 0, and np remains constant. In practice, Poisson’s 
limit is used to approximate hard-to-calculate binomial probabilities where the val- 
ues of n and p reflect the conditions of the limit—that is, when n is large and p 
is small. 


The Poisson Limit 


Deriving an asymptotic expression for the binomial probability model is a straight- 
forward exercise in calculus, given that np is to remain fixed as n increases. 


Suppose X is a binomial random variable, where 
n\ n—k 
P(X =k) = px(k) = ())p (l—p)"*, k=0,1,...,n 
Ifn— co and p— 0 in such a way that } =np remains constant, then 


Par oP enn (np)* 


; ea n\ k 
jim, PX== im (7) =p) 7 
p—>0 >0 
np = const. np = const. 


Example 
4.2.1 


4.2 The Poisson Distribution 


Proof We begin by rewriting the binomial probability in terms of A: 
k n—-k 
eo (YN key yn-k _ 1; i) a LA 
no (1 i ye 
= lim Xr 1 1 
n>oo k!(n —k)! nk n n 


ua n! 1 A\" 
= lim 1 
kl noo (n—k)! (n—A)k n 


But since [1 — (A/n)]" > e~* as n > ov, we need show only that 


n! 


>I 
(n—k)!\(n —A)k 
to prove the theorem. However, note that 
n! n(n—1)---(n—k+1) 


(n—k)'\(n—A)kK (n—A)(N—A)---(—N—-A) 


a quantity that, indeed, tends to 1 as n > oo (since 4 remains constant). 
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Theorem 4.2.1 is an asymptotic result. Left unanswered is the question of the rele- 
vance of the Poisson limit for finite n and p. That is, how large does n have to be and 
how small does p have to be before e~”? (np)*/k! becomes a good approximation to 


the binomial probability, px (k)? 


Since “good approximation” is undefined, there is no way to answer that ques- 
tion in any completely specific way. Tables 4.2.1 and 4.2.2, though, offer a partial 
solution by comparing the closeness of the approximation for two particular sets of 


values for n and p. 


In both cases 4 = np is equal to 1, but in the former, is set equal to 5—in the 
latter, to 100. We see in Table 4.2.1 (n =5) that for some k the agreement between 
the binomial probability and Poisson’s limit is not very good. If n is as large as 100, 


though (Table 4.2.2), the agreement is remarkably good for all k. 


Table 4.2.1 Binomial Probabilities and Poisson 
Limits; n =5 and p=? (A=1) 
5 k 5—k e'(1)f 
k (2) (0.2)*(0.8) ~ 
0 0.328 0.368 
1 0.410 0.368 
2 0.205 0.184 
3 0.051 0.061 
4 0.006 0.015 
5 0.000 0.003 
6+ 0 0.001 
1.000 1.000 
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4.2.2 


Table 4.2.2 Binomial Probabilities and Poisson 
Limits; n = 100 and p= ;j (A=1) 
100 F eee erdy 
k ( k ) (0.01)*(0.99) a 

0 0.366032 0.367879 
1 0.369730 0.367879 
2 0.184865 0.183940 
3 0.060999 0.061313 
4 0.014942 0.015328 
5 0.002898 0.003066 
6 0.000463 0.000511 
7 0.000063 0.000073 
8 0.000007 0.000009 
9 0.000001 0.000001 
10 0.000000 0.000000 
1.000000 0.999999 


According to the IRS, 137.8 million individual tax returns were filed in 2008. Out of 
that total, 1.4 million taxpayers, or 1.0%, had the good fortune of being audited. Not 
everyone had the same chance of getting caught in the IRS’s headlights: millionaires 
had the considerably higher audit rate of 5.6% (and that number might even go up 
a bit more if the feds find out about your bank accounts in the Caymans and your 
vacation home in Rio). Criminal investigations were initiated against 3749 of all 
those audited, and 1735 of that group were eventually convicted of tax fraud and 
sent to jail. 

Suppose your hometown has 65,000 taxpayers, whose income profile and pro- 
clivity for tax evasion are similar to those of citizens of the United States as a whole, 
and suppose the IRS enforcement efforts remain much the same in the foreseeable 
future. What is the probability that at least three of your neighbors will be house 
guests of Uncle Sam next year? 

Let X denote the number of your neighbors who will be incarcerated. Note 
that X is a binomial random variable based on a very large n (= 65,000) and a very 
small p (= 1735/137,800,000 = 0.0000126), so Poisson’s limit is clearly applicable 
(and helpful). Here, 


P(At least three neighbors go to jail) = P(X > 3) 
=1-— P(X <2) 


2 
aa ai _ ) 0:0000126)"0.9999874)0"-* 
k=0 


2 
0.819)* 
se > e 0819 es = 0.050 
k=0 . 


where 4 = np = 65,000(0.0000126) = 0.819. = 
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Case Study 4.2.1 


Leukemia is a rare form of cancer whose cause and mode of transmission 
remain largely unknown. While evidence abounds that excessive exposure to 
radiation can increase a person’s risk of contracting the disease, it is at the same 
time true that most cases occur among persons whose history contains no such 
overexposure. A related issue, one maybe even more basic than the causality 
question, concerns the spread of the disease. It is safe to say that the prevailing 
medical opinion is that most forms of leukemia are not contagious—still, the 
hypothesis persists that some forms of the disease, particularly the childhood 
variety, may be. What continues to fuel this speculation are the discoveries of 
so-called “leukemia clusters,” aggregations in time and space of unusually large 
numbers of cases. 

To date, one of the most frequently cited leukemia clusters in the medical 
literature occurred during the late 1950s and early 1960s in Niles, Illinois, a sub- 
urb of Chicago (75). In the 5-year period from 1956 to the first four months 
of 1961, physicians in Niles reported a total of eight cases of leukemia among 
children less than fifteen years of age. The number at risk (that is, the number of 
residents in that age range) was 7076. To assess the likelihood of that many cases 
occurring in such a small population, it is necessary to look first at the leukemia 
incidence in neighboring towns. For all of Cook County, excluding Niles, there 
were 1,152,695 children less than fifteen years of age—and among those, 286 
diagnosed cases of leukemia. That gives an average 54-year leukemia rate of 
24.8 cases per 100,000: 


286 cases for 54 years 100,000 
1,152,695children ~~ 100,000 


= 24.8 cases/100,000 children in 54 years 


Now, imagine the 7076 children in Niles to be a series of n = 7076 (indepen- 
dent) Bernoulli trials, each having a probability of p = 24.8/100,000 = 0.000248 
of contracting leukemia. The question then becomes, given an n of 7076 and 
a p of 0.000248, how likely is it that eight “successes” would occur? (The 
expected number, of course, would be 7076 x 0.000248 = 1.75.) Actually, for 
reasons that will be elaborated on in Chapter 6, it will prove more meaningful 
to consider the related event, eight or more cases occurring in a 54-year span. 
If the probability associated with the latter is very small, it could be argued that 
leukemia did not occur randomly in Niles and that, perhaps, contagion was a 
factor. 

Using the binomial distribution, we can express the probability of eight or 
more cases as 


RS (7076 
P(8 or more cases) = > ( : ) 0.000248) (0.999752) 076-* (4.2.1) 
k=8 


Much of the computational unpleasantness implicit in Equation 4.2.1 can be 
avoided by appealing to Theorem 4.2.1. Given that np = 7076 x 0.000248 = 1.75, 


(Continued on next page) 
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(Case Study 4.2.1 continued) 


P(X >8)=1— P(X <7) 


7 e—!-75(1.75)* 


=1 
i k! 

= 1 — 0.99953 

= 0.00047 


How close can we expect 0.00047 to be to the “true” binomial sum? Very 
close. Considering the accuracy of the Poisson limit when n is as small as one 
hundred (recall Table 4.2.2), we should feel very confident here, where n is 7076. 

Interpreting the 0.00047 probability is not nearly as easy as assessing its 
accuracy. The fact that the probability is so very small tends to denigrate the 
hypothesis that leukemia in Niles occurred at random. On the other hand, rare 
events, such as clusters, do happen by chance. The basic difficulty of putting 
the probability associated with a given cluster into any meaningful perspective 
is not knowing in how many similar communities leukemia did not exhibit a 
tendency to cluster. That there is no obvious way to do this is one reason the 
leukemia controversy is still with us. 


About the Data Publication of the Niles cluster led to a number of research efforts 
on the part of biostatisticians to find quantitative methods capable of detecting 
clustering in space and time for diseases having low epidemicity. Several tech- 
niques were ultimately put forth, but the inherent “noise” in the data—variations in 
population densities, ethnicities, risk factors, and medical practices—often proved 


impossible to overcome. 


Questions 


4.2.1. If a typist averages one misspelling in every 3250 
words, what are the chances that a 6000-word report is 
free of all such errors? Answer the question two ways— 
first, by using an exact binomial analysis, and second, by 
using a Poisson approximation. Does the similarity (or 
dissimilarity) of the two answers surprise you? Explain. 


4.2.2. A medical study recently documented that 905 mis- 
takes were made among the 289,411 prescriptions written 
during one year at a large metropolitan teaching hospi- 
tal. Suppose a patient is admitted with a condition serious 
enough to warrant 10 different prescriptions. Approxi- 
mate the probability that at least one will contain an 
error. 


4.2.3. Five hundred people are attending the first annual 
“I was Hit by Lighting” Club. Approximate the proba- 
bility that at most one of the five hundred was born on 
Poisson’s birthday. 


4.2.4. A chromosome mutation linked with colorblind- 
ness is known to occur, on the average, once in every ten 
thousand births. 


(a) Approximate the probability that exactly three of 
the next twenty thousand babies born will have the 
mutation. 

(b) How many babies out of the next twenty thou- 
sand would have to be born with the mutation 
to convince you that the “one in ten thousand” 
estimate is too low? [Hint: Calculate P(X >k)= 
1 — P(X <k—1) for various k. (Recall Case 
Study 4.2.1.)] 


4.2.5. Suppose that 1% of all items in a supermarket are 
not priced properly. A customer buys ten items. What is 
the probability that she will be delayed by the cashier 
because one or more of her items require a price check? 


Calculate both a binomial answer and a Poisson answer. Is 
the binomial model “exact” in this case? Explain. 


4.2.6. A newly formed life insurance company has under- 
written term policies on 120 women between the ages of 
forty and forty-four. Suppose that each woman has a 1/150 
probability of dying during the next calendar year, and 
that each death requires the company to pay out $50,000 
in benefits. Approximate the probability that the company 
will have to pay at least $150,000 in benefits next year. 


4.2.7. According to an airline industry report (178), 
roughly 1 piece of luggage out of every 200 that are 
checked is lost. Suppose that a frequent-flying business- 
woman will be checking 120 bags over the course of the 
next year. Approximate the probability that she will lose 
2 of more pieces of luggage. 
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4.2.8. Electromagnetic fields generated by power trans- 
mission lines are suspected by some researchers to be a 
cause of cancer. Especially at risk would be telephone line- 
men because of their frequent proximity to high-voltage 
wires. According to one study, two cases of a rare form 
of cancer were detected among a group of 9500 linemen 
(174). In the general population, the incidence of that par- 
ticular condition is on the order of one in a million. What 
would you conclude? (Hint: Recall the approach taken in 
Case Study 4.2.1.) 


4.2.9. Astronomers estimate that as many as one hundred 
billion stars in the Milky Way galaxy are encircled by plan- 
ets. If so, we may have a plethora of cosmic neighbors. Let 
p denote the probability that any such solar system con- 
tains intelligent life. How small can p be and still give a 
fifty-fifty chance that we are not alone? 


The Poisson Distribution 


The real significance of Poisson’s limit theorem went unrecognized for more than 
fifty years. For most of the latter part of the nineteenth century, Theorem 4.2.1 
was taken strictly at face value: It provides a convenient approximation for px (k) 
when X is binomial, n is large, and p is small. But then in 1898 a German professor, 
Ladislaus von Bortkiewicz, published a monograph entitled Das Gesetz der Kleinen 
Zahlen (The Law of Small Numbers) that would quickly transform Poisson’s “limit” 
into Poisson’s “distribution.” 

What is best remembered about Bortkiewicz’s monograph is the curious set of 
data described in Question 4.2.10. The measurements recorded were the numbers of 
Prussian cavalry soldiers who had been kicked to death by their horses. In analyzing 
those figures, Bortkiewicz was able to show that the function e~*A*/k! is a useful 
probability model in its own right, even when (1) no explicit binomial random vari- 
able is present and (2) values for n and p are unavailable. Other researchers were 
quick to follow Bortkiewicz’s lead, and a steady stream of Poisson distribution appli- 
cations began showing up in technical journals. Today the function p x(k) =e~*A*/k! 
is universally recognized as being among the three or four most important data 
models in all of statistics. 


Theorem The random variable X is said to have a Poisson distribution if 
4.2.2 oe 
prk)=P(X=h =>, k=0,1,2.... 


where i is a positive constant. Also, for any Poisson random variable, E(X) = and 
Var(X) =i. 


Proof To show that px(k) qualifies as a probability function, note, first of all, that 
px(k) = 0 for all nonnegative integers k. Also, py(k) sums to 1: 
CO CO 


90. —Ayk k 
e “xr = Xr - 
Dx = eae Get al 
k=0 k=0 k=0 


[o.e) 
since )° . is the Taylor series expansion of e*. Verifying that E(X) =A and 
k=0~ 


Var(X) = i has already been done in Example 3.12.9, using moment-generating 
functions. 
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Fitting the Poisson Distribution to Data 


Poisson data invariably refer to the numbers of times a certain event occurs during 
each of a series of “units” (often time or space). For example, X might be the weekly 
number of traffic accidents reported at a given intersection. If such records are kept 
for an entire year, the resulting data would be the sample ky, k2,...,ks2, where each 
k; is a nonnegative integer. 

Whether or not a set of k;’s can be viewed as Poisson data depends on whether 
the proportions of 0’s, 1’s, 2’s, and so on, in the sample are numerically similar to the 
probabilities that X = 0, 1, 2, and so on, as predicted by px (k) =e7*A*/k!. The next 
two case studies show data sets where the variability in the observed k;’s is consistent 
with the probabilities predicted by the Poisson distribution. Notice in each case that 


the A in px(k) is replaced by the sample mean of the k;’s—that is, by k = (1/n) kj. 


c=1 
Why these phenomena are described by the Poisson distribution will be discussed 
later in this section; why A is replaced by k will be explained in Chapter 5. 


Case Study 4.2.2 


Among the early research projects investigating the nature of radiation was a 
1910 study of a-particle emission by Ernest Rutherford and Hans Geiger (152). 
For each of 2608 eighth-minute intervals, the two physicists recorded the num- 
ber of a particles emitted from a polonium source (as detected by what would 
eventually be called a Geiger counter). The numbers and proportions of times 
that k such particles were detected in a given eighth-minute (k= 0, 1,2,...) are 
detailed in the first three columns of Table 4.2.3. Two a particles, for example, 
were detected in each of 383 eighth-minute intervals, meaning that X =2 was 
the observation recorded 15% (= 383/2608 x 100) of the time. 


Table 4.2.3 

No. Detected,k Frequency Proportion px(k) =e-7*’(3.87)*/k! 

0 a7 0.02 0.02 

1 203 0.08 0.08 

2 383 0.15 0.16 

3 525 0.20 0.20 

4 532 0.20 0.20 

5 408 0.16 0.15 

6 273 0.10 0.10 

7 139 0.05 0.05 

8 45 0.02 0.03 

9 27 0.01 0.01 

10 10 0.00 0.00 

Ur _ 6 0.00 0.00 

2608 1.0 1.0 


(Continued on next page) 
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To see whether a probability function of the form py(k) = e*A*/k! can 
adequately model the observed proportions in the third column, we first need 
to replace 4 with the sample’s average value for X. Suppose the six observations 
comprising the “11+” category are each assigned the value 11. Then 


i 57(0) + 203(1) + 383(2)+---+6(11) _ 10,092 
~ 2608 ~~ 2608 


= 3.87 


and the presumed model is px (k) = e787 (3.87)‘/k!, k =0, 1,2,.... Notice how 
closely the entries in the fourth column [i.e., px (0), px (1), px (2), ...] agree with 
the sample proportions appearing in the third column. The conclusion here is 
inescapable: The phenomenon of radiation can be modeled very effectively by 
the Poisson distribution. 


About the Data The most obvious (and frequent) application of the Pois- 
son/radioactivity relationship is to use the former to describe and predict the 
behavior of the latter. But the relationship is also routinely used in reverse. Work- 
ers responsible for inspecting areas where radioactive contamination is a potential 
hazard need to know that their monitoring equipment is functioning properly. How 
do they do that? A standard safety procedure before entering what might be a life- 
threatening “hot zone” is to take a series of readings on a known radioactive source 
(much like the Rutherford/Geiger experiment itself). If the resulting set of counts 
does not follow a Poisson distribution, the meter is assumed to be broken and must 
be repaired or replaced. 


Case Study 4.2.3 


In the 432 years from 1500 to 1931, war broke out somewhere in the world 
a total of 299 times. By definition, a military action was a war if it either was 
legally declared, involved over fifty thousand troops, or resulted in significant 
boundary realignments. To achieve greater uniformity from war to war, major 
confrontations were split into smaller “subwars”: World War I, for example, was 
treated as five separate conflicts (143). 

Let X denote the number of wars starting in a given year. The first two 
columns in Table 4.2.4 show the distribution of X for the 432-year period in 
question. Here the average number of wars beginning in a given year was 0.69: 


0(223) + 1(142) + 2(48) +3(15) +44) | 

432 7 
The last two columns in Table 4.2.4 compare the observed proportions of years 
for which X =k with the proposed Poisson model 


965 C09)" 
kl’ 


k= 0.69 


px(k) =e k=0,1,2,... 


(Continued on next page) 


230 Chapter 4 Special Distributions 


Figure 4.2.1 


(Case Study 4.2.3 continued) 


Table 4.2.4 
0.69)* 
Number of Wars,k Frequency Proportion p x(k)=e°®” ee 
0 223 0.52 0.50 
1 142 0.33 0.35 
2 48 0.11 0.12 
3 15 0.03 0.03 
44 4 0.01 0.00 
432 1.00 1.00 


Clearly, there is a very close agreement between the two—the number of wars 
beginning in a given year can be considered a Poisson random variable. 


The Poisson Model: The Law of Small Numbers 


Given that the expression e~*A*/k! models phenomena as diverse as w-radiation and 
outbreak of war raises an obvious question: Why is that same px (k) describing such 
different random variables? The answer is that the underlying physical conditions 
that produce those two sets of measurements are actually much the same, despite 
how superficially different the resulting data may seem to be. Both phenomena are 
examples of a set of mathematical assumptions known as the Poisson model. Any 
measurements that are derived from conditions that mirror those assumptions will 
necessarily vary in accordance with the Poisson distribution. 

Suppose a series of events is occurring during a time interval of length T. Imag- 
ine dividing T into n nonoverlapping subintervals, each of length 4, where n is large 
(see Figure 4.2.1). Furthermore, suppose that 


Tin events occur 
Pht LI 
i 2 3 4 3 —_ ii 


a, 
T 


1. The probability that two or more events occur in any given subinterval is 

essentially 0. 

The events occur independently. 

3. The probability that an event occurs during a given subinterval is constant over 
the entire interval from 0 to T. 


N 


The n subintervals, then, are analogous to the n independent trials that form the 
backdrop for the “binomial model”: In each subinterval there will be either zero 
events or one event, where 


Pn = P(Event occurs in a given subinterval) 


remains constant from subinterval to subinterval. 


Example 
4.2.3 
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Let the random variable X denote the total number of events occurring during 
time 7, and let 4 denote the rate at which events occur (e.g., 4 might be expressed 
as 2.5 events per minute). Then 


E(X)=AT =npn (why?) 


which implies that p, = “". From Theorem 4.2.1, then, 


n 


ns fAT\* iNet 
pat) = PX =k) =(1) (7) (-) 


_ en Aar/n) [n (aT /n)]" 
7 k! 
eT (AT) 


= (4.2.2) 


Now we can see more clearly why Poisson’s “limit,” as given in Theorem 4.2.1, 
is so important. The three Poisson model assumptions are so unexceptional that 
they apply to countless real-world phenomena. Each time they do, the pdf px(k) = 
e~*" (,T)*/k! finds another application. 


It is not surprising that the number of a particles emitted by a radioactive source in a 
given unit of time follows a Poisson distribution. Nuclear physicists have known for 
a long time that the phenomenon of radioactivity obeys the same assumptions that 
define the Poisson model. Each is a poster child for the other. Case Study 4.2.3, on 
the other hand, is a different matter altogether. It is not so obvious why the number 
of wars starting in a given year should have a Poisson distribution. Reconciling the 
data in Table 4.2.4 with the “picture” of the Poisson model in Figure 4.2.1 raises a 
number of questions that never came up in connection with radioactivity. 

Imagine recording the data summarized in Table 4.2.4. For each year, new wars 
would appear as “occurrences” on a grid of cells, similar to the one pictured in 
Figure 4.2.2 for 1776. Civil wars would be entered along the diagonal and wars 
between two countries, above the diagonal. Each cell would contain either a 0 (no 
war) or a 1 (war). The year 1776 saw the onset of only one major conflict, the 
Revolutionary War between the United States and Britain. If the random variable 


X; =number of outbreaks of war in year i, i = 1500, 1501, ..., 1931 
then X1776 =1. 


What do we know, in general, about the random variable X;? If each cell in the 
grid is thought of as a “trial,” then X; is clearly the number of “successes” in those 
n trials. Does that make X; a binomial random variable? Not necessarily. According 
to Theorem 3.2.1, X; qualifies as a binomial random variable only if the trials are 
independent and the probability of success is the same from trial to trial. 

At first glance, the independence assumption would seem to be problematic. 
There is no denying that some wars are linked to others. The timing of the French 
Revolution, for example, is widely thought to have been influenced by the success 
of the American Revolution. Does that make the two wars dependent? In a his- 
torical sense, yes; in a statistical sense, no. The French Revolution began in 1789, 
thirteen years after the onset of the American Revolution. The random variable 
X 1776, though, focuses only on wars starting in 1776, so linkages that are years apart 
do not compromise the binomial’s independence assumption. 

Not all wars identified in Case Study 4.2.3, though, can claim to be independent 
in the statistical sense. The last entry in Column 2 of Table 4.2.4 shows that four 
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Figure 4.2.2 


or more wars erupted on four separate occasions; the Poisson model (Column 4) 
predicted that no years would experience that many new wars. Most likely, those 
four years had a decided excess of new wars because of political alliances that led to 
a cascade of new wars being declared simultaneously. Those wars definitely violated 
the binomial assumption of independent trials, but they accounted for only a very 
small fraction of the entire data set. 

The other binomial assumption—that each trial has the same probability of 
success — holds fairly well. For the vast majority of years and the vast majority of 
countries, the probabilities of new wars will be very small and most likely similar. 
For almost every year, then, X; can be considered a binomial random variable based 
on a very large n and a very small p. That being the case, it follows by Theorem 4.2.1 
that each X;, i= 1500, 1501, ..., 1931 can be approximated by a Poisson distribution. 

One other assumption needs to be addressed. Knowing that X1500, X1so1, ---; 
X 1931 —individually—are Poisson random variables does not guarantee that the 
distribution of all 432 X;’s will have a Poisson distribution. Only if the X;’s are inde- 
pendent observations having basically the same Poisson distribution—that is, the 
same value for A—will their overall distribution be Poisson. But Table 4.2.4 does 
have a Poisson distribution, implying that the set of X;’s does, in fact, behave like a 
random sample. Along with that sweeping conclusion, though, comes the realization 
that, as a species, our levels of belligerence at the national level [that is, the 432 val- 
ues for A = E(X;)] have remained basically the same for the past five hundred years. 
Whether that should be viewed as a reason for celebration or a cause for alarm is a 
question best left to historians, not statisticians. = 


Calculating Poisson Probabilities 
Three formulas have appeared in connection with the Poisson distribution: 


k 
1. px(k) =e"? SEY 
k 
2. px(k) =e" e 

k 

3. px(ky=e OTS 
The first is the approximating Poisson limit, where the px(k) on the left-hand side 
refers to the probability that a binomial random variable (with parameters n and p) 


Questions 


Example 
4.2.4 


4.2 The Poisson Distribution 233 


is equal to k. Formulas (2) and (3) are sometimes confused because both presume to 
give the probability that a Poisson random variable equals k. Why are they different? 

Actually, all three formulas are the same in the sense that the right-hand sides 
of each could be written as 


—E(X) [EGO 
4, 6-200 LEGO) 


In formula (1), X is binomial, so E(X) =np. In formula (2), which comes from Theo- 
rem 4.2.2, is defined to be E(X). Formula (3) covers all those situations where the 
units of X and A are not consistent, in which case E(X) #44. However, A can always 
be multiplied by an appropriate constant T to make AT equal to E(X). 

For example, suppose a certain radioisotope is known to emit @ particles at the 
rate of A = 1.5 emissions/second. For whatever reason, though, the experimenter 
defines the Poisson random variable X to be the number of emissions counted in a 
given minute. Then T = 60 seconds and 


E(X)=1.5 emissions/second x 60 seconds 


=AT =90 emissions 


Entomologists estimate that an average person consumes almost a pound of bug 
parts each year (173). There are that many insect eggs, larvae, and miscellaneous 
body pieces in the foods we eat and the liquids we drink. The Food and Drug 
Administration (FDA) sets a Food Defect Action Level (FDAL) for each product: 
Bug-part concentrations below the FDAL are considered acceptable. The legal limit 
for peanut butter, for example, is thirty insect fragments per hundred grams. Sup- 
pose the crackers you just bought from a vending machine are spread with twenty 
grams of peanut butter. What are the chances that your snack will include at least 
five crunchy critters? 

Let X denote the number of bug parts in twenty grams of peanut butter. Assum- 
ing the worst, suppose the contamination level equals the FDA limit—that is, thirty 
fragments per hundred grams (or 0.30 fragment/g). Notice that T in this case is 
twenty grams, making E(X) = 6.0: 


0.30f t 
pckane coe x 20 g= 6.0 fragments 


It follows, then, that the probability that your snack contains five or more bug parts 
is a disgusting 0.71: 


4 


P(X25)=1- P(X <4)=1- J OH 
k=0 
=1-0.29 
=0.71 
Bon appetit! = 


4.2.10. During the latter part of the nineteenth century, annual number of fatalities due to kicks. Summarized in 
Prussian officials gathered information relating to the haz- _ the following table are the two hundred values recorded 
ards that horses posed to cavalry soldiers. A total of ten for X (12). Show that these data can be modeled by 
cavalry corps were monitored over a period of twenty a Poisson pdf. Follow the procedure illustrated in Case 
years. Recorded for each year and each corps was X, the Studies 4.2.2 and 4.2.3. 
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Observed Number of Corps- Years 


No. of Deaths, k in Which k Fatalities Occurred 


0 109 
1 65 
2 22 
3 3 
4 1 

200 


4.2.11. A random sample of 356 seniors enrolled at the 
University of West Florida was categorized according to 
X, the number of times they had changed majors (110). 
Based on the summary of that information shown in the 
following table, would you conclude that X can be treated 
as a Poisson random variable? 


Number of Major Changes Frequency 
0 237 
1 90 
2 22 
3 7 


4.2.12. Midwestern Skies books ten commuter flights 
each week. Passenger totals are much the same from week 
to week, as are the numbers of pieces of luggage that are 
checked. Listed in the following table are the numbers of 
bags that were lost during each of the first forty weeks in 
2009. Do these figures support the presumption that the 
number of bags lost by Midwestern during a typical week 
is a Poisson random variable? 


Week Bags Lost Week Bags Lost Week Bags Lost 


1 1 14 2 27 1 
2 0 15 1 28 2 
3 0 16 3 29 0 
4 3 17 0 30 0 
5 4 18 2 31 1 
6 1 19 5 32 3 
7 0 20 2 33 1 
8 2 21 1 34 2 
9 0 22 1 35 0 
10 2 23 1 36 1 
11 3 24 2 37 4 
12 1 25 1 38 2 
13 2 26 3 39 1 
40 0 


4.2.13. In 1893, New Zealand became the first country 
to permit women to vote. Scattered over the ensuing 113 
years, various countries joined the movement to grant this 


right to women. The table below (121) shows how many 
countries took this step in a given year. Do these data 
seem to follow a Poisson distribution? 


Yearly Number of Countries 


Granting Women the Vote Frequency 
0 82 
1 25 
2 4 
3 0 
4 2 


4.2.14. The following are the daily numbers of death 
notices for women over the age of eighty that appeared 
in the London Times over a three-year period (74). 


Number of Deaths Observed Frequency 


162 
267 
271 
185 
111 
61 
27 
8 

3 

1 


1096 
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(a) Does the Poisson pdf provide a good description of 
the variability pattern evident in these data? 

(b) If your answer to part (a) is “no,” which of the Pois- 
son model assumptions do you think might not be 
holding? 


4.2.15. A certain species of European mite is capable of 
damaging the bark on orange trees. The following are the 
results of inspections done on one hundred saplings cho- 
sen at random from a large orchard. The measurement 
recorded, X, is the number of mite infestations found on 
the trunk of each tree. Is it reasonable to assume that X 
is a Poisson random variable? If not, which of the Poisson 
model assumptions is likely not to be true? 


No. of Infestations,k No. of Trees 


0 55 
1 20 
2 21 
3 1 
4 1 
5 1 
6 0 
7 1 


4.2.16. A tool and die press that stamps out cams used 
in small gasoline engines tends to break down once every 
five hours. The machine can be repaired and put back on 
line quickly, but each such incident costs $50. What is the 
probability that maintenance expenses for the press will 
be no more than $100 on a typical eight-hour workday? 


4.2.17. In a new fiber-optic communication system, trans- 
mission errors occur at the rate of 1.5 per ten seconds. 
What is the probability that more than two errors will 
occur during the next half-minute? 


4.2.18. Assume that the number of hits, X, that a baseball 
team makes in a nine-inning game has a Poisson distribu- 
tion. If the probability that a team makes zero hits is + 


3 bf 
what are their chances of getting two or more hits? 


4.2.19. Flaws in metal sheeting produced by a high- 
temperature roller occur at the rate of one per ten square 
feet. What is the probability that three or more flaws will 
appear in a five-by-eight-foot panel? 


4.2.20. Suppose a radioactive source is metered for two 
hours, during which time the total number of alpha par- 
ticles counted is 482. What is the probability that exactly 
three particles will be counted in the next two minutes? 
Answer the question two ways—first, by defining X to 
be the number of particles counted in two minutes, and 
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second, by defining X to be the number of particles 
counted in one minute. 


4.2.21. Suppose that on-the-job injuries in a textile mill 
occur at the rate of 0.1 per day. 


(a) What is the probability that two accidents will occur 
during the next (five-day) workweek? 

(b) Is the probability that four accidents will occur over 
the next two workweeks the square of your answer 
to part (a)? Explain. 


4.2.22. Find P(X = 4) if the random variable X has a 
Poisson distribution such that P(X =1)= P(X =2). 


4.2.23. Let X be a Poisson random variable with param- 
eter A. Show that the probability that X is even is (1 + 
er), 


4.2.24. Let X and Y be independent Poisson random 
variables with parameters 4 and jz, respectively. Example 
3.12.10 established that X + Y is also Poisson with param- 
eter 4+ yu. Prove that same result using Theorem 3.8.3. 


4.2.25. If X, is a Poisson random variable for which 
E(X,) =A and if the conditional pdf of X, given that 
X, =x, is binomial with parameters x, and p, show that 
the marginal pdf of X, is Poisson with E(X2) = Ap. 


Intervals Between Events: The Poisson/Exponential Relationship 


Situations sometimes arise where the time interval between consecutively occurring 
events is an important random variable. Imagine being responsible for the main- 
tenance on a network of computers. Clearly, the number of technicians you would 
need to employ in order to be capable of responding to service calls in a timely 
fashion would be a function of the “waiting time” from one breakdown to another. 

Figure 4.2.3 shows the relationship between the random variables X and Y, 
where X denotes the number of occurrences in a unit of time and Y denotes the 
interval between consecutive occurrences. Pictured are six intervals: X = 0 on one 
occasion, X = | on three occasions, X =2 once, and X =3 once. Resulting from those 
eight occurrences are seven measurements on the random variable Y. Obviously, the 
pdf for Y will depend on the pdf for X. One particularly important special case of 
that dependence is the Poisson/exponential relationship outlined in Theorem 4.2.3. 


Figure 4.2.3 


Y values: yy 


Unit time 


Theorem 
4.2.3 


Suppose a series of events satisfying the Poisson model are occurring at the rate of 
A per unit time. Let the random variable Y denote the interval between consecutive 


events. Then Y has the exponential distribution 


Rojysie™. »y=0 
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Proof Suppose an event has occurred at time a. Consider the interval that extends 
from a toa-+ y. Since the (Poisson) events are occurring at the rate of per unit time, 


eqs : : : + eT AY (Ay) j 
the probability that no outcomes will occur in the interval (a, a+ y) is eae =e, 


Define the random variable Y to denote the interval between consecutive occur- 
rences. Notice that there will be no occurrences in the interval (a,a+ y) only if 
Y > y. Therefore, 


PY>y)=e” 
or, equivalently, 
Fy(y)=P(Y <y)=1-P(Y>y)=1-e” 
Let fy(y) be the (unknown) pdf for Y. It must be true that 


y 
pasy=[ frat 
0 
Taking derivatives of the two expressions for Fy(y) gives 
d f{* d 
— t)dt=—(1-e” 
dy i; fr (t) dy | e”’) 


which implies that 


frQ)=re’’, y>0 


Case Study 4.2.4 


Over “short” geological periods, a volcano’s eruptions are believed to be 
Poisson events—that is, they are thought to occur independently and at a con- 
stant rate. If so, the pdf describing the intervals between eruptions should 
have the form fy(y) = Ae~*’. Collected for the purpose of testing that pre- 
sumption are the data in Table 4.2.5, showing the intervals (in months) that 
elapsed between thirty-seven consecutive eruptions of Mauna Loa, a fourteen- 
thousand-foot volcano in Hawaii (106). During the period covered—1832 to 
1950—eruptions were occurring at the rate of A = 0.027 per month (or once 
every 3.1 years). Is the variability in these thirty-six y;’s consistent with the 
statement of Theorem 4.2.3? 


Table 4.2.5 


To answer that question requires that the data be reduced to a density- 
scaled histogram and superimposed on a graph of the predicted exponential pdf 


(Continued on next page) 
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(recall Case Study 3.4.1). Table 4.2.6 details the construction of the histogram. 
Notice in Figure 4.2.4 that the shape of that histogram is entirely consistent with 
the theoretical model— fy (y) = 0.027e~°°’” — stated in Theorem 4.2.3. 


Table 4.2.6 


Interval (mos), y Frequency Density 


0 <y<20 13 0.0181 
20 < y <40 9 0.0125 
40 <y <60 5 0.0069 
60<y <80 6 0.0083 
80 < y < 100 2 0.0028 
100 < y < 120 0 0.0000 
120 < y < 140 1 0.0014 
36 


-0.027y 


ee fy (y) = 0.027e 


Interval between eruptions (in months) 


Figure 4.2.4 


About the Data Among pessimists, a favorite saying is “Bad things come in 
threes.” Optimists, not to be outdone, claim that “Good things come in threes.” Are 
they right? In a sense, yes, but not because of fate, bad karma, or good luck. Bad 
things (and good things and so-so things) seem to come in threes because of (1) our 
intuition’s inability to understand randomness and (2) the Poisson/exponential rela- 
tionship. Case Study 4.2.4—specifically, the shape of the exponential pdf pictured in 
Figure 4.2.4—illustrates the statistics behind the superstition. 

Random events, such as volcanic eruptions, do not occur at equally spaced 
intervals. Nor do the intervals between consecutive occurrences follow some sort 
of symmetric distribution, where the most common separations are close to the 
average separations. Quite the contrary. The Poisson/exponential relationship guar- 
antees that the distribution of interval lengths between consecutive occurrences will 
be sharply skewed [look again at fy(y)], implying that the most common separation 
lengths will be the shortest ones. 

Suppose that bad things are, in fact, happening to us randomly in time. Our intu- 
itions unconsciously get a sense of the rate at which those bad things are occurring. 
If they happen at the rate of, say, twelve bad things per year, we mistakenly think 
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Example 
4.2.5 


Questions 


they should come one month apart. But that is simply not the way random events 
behave, as Theorem 4.2.3 clearly shows. 

Look at the entries in Table 4.2.5. The average of those thirty-six (randomly 
occurring) eruption separations was 37.7 months, yet seven of the separations were 
extremely short (less than or equal to six months). If two of those extremely short 
separations happened to occur consecutively, it would be tempting (but wrong) to 
conclude that the eruptions (since they came so close together) were “occurring in 
threes” for some supernatural reason. 

Using the combinatorial techniques discussed in Section 2.6, we can calculate 
the probability that two extremely short intervals would occur consecutively. Think 
of the thirty-six intervals as being either “normal” or “extremely short.” There are 
twenty-nine in the first group and seven in the second. Using the method described 
in Example 2.6.21, the probability that two extremely short separations would occur 
consecutively at least once is 61%, which hardly qualifies as a rare event: 


P(Two extremely short separations occur consecutively at least once) 


30) (6) _, (30). (5 30) (4 
_ cl] : G)+ (2) 2 (6) + (7) : () = O61 
a 36 = 
(55) 
So, despite what our intuitions might tell us, the phenomenon of bad things coming 
in threes is neither mysterious nor uncommon or unexpected. 


Among the most famous of all meteor showers are the Perseids, which occur each 
year in early August. In some areas the frequency of visible Perseids can be as high as 
forty per hour. Given that such sightings are Poisson events, calculate the probability 
that an observer who has just seen a meteor will have to wait at least five minutes 
before seeing another one. 

Let the random variable Y denote the interval (in minutes) between consecu- 
tive sightings. Expressed in the units of Y, the forty-per-hour rate of visible Perseids 
becomes 0.67 per minute. A straightforward integration, then, shows that the proba- 
bility is 0.035 that an observer will have to wait five minutes or more to see another 
meteor: 


oe) 
P(Y>5)= / 0.67¢ ~°°79 dy 
5 


lo.@) 
/ e “du (whereu =0.67y) 
3.35 


—uU 


eg _ 3.35 


lee 
3.35 — © 


= 0.035 a 


4.2.26. Suppose that commercial airplane crashes in a (c) What is the probability that the next two crashes will 
certain country occur at the rate of 2.5 per year. occur within three months of one another? 


(a) Is it reasonable to assume that such crashes are 4.2.27. Records show that deaths occur at the rate of 0.1 


Poisson events? Explain. 


per day among patients residing in a large nursing home. 


(b) What is the probability that four or more crashes will If someone dies today, what are the chances that a week 


occur next year? 


or more will elapse before another death occurs? 


4.2.28. Suppose that Y; and Y, are independent exponen- 
tial random variables, each having pdf fy(y)=Ae~*’, y > 0. 
If Y=Y, + Yo, it can be shown that 


hy 


fri) =A yer ’ y>0 


Recall Case Study 4.2.4. What is the probability that the 
next three eruptions of Mauna Loa will be less than forty 
months apart? 


4.2.29. Fifty spotlights have just been installed in an out- 
door security system. According to the manufacturer’s 
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specifications, these particular lights are expected to burn 
out at the rate of 1.1 per one hundred hours. What is the 
expected number of bulbs that will fail to last for at least 
seventy-five hours? 


4.2.30. Suppose you want to invent a new superstition 
that “Bad things come in fours.” Using the data given 
in Case Study 4.2.4 and the type of analysis described 
on p. 238, calculate the probability that your superstition 
would appear to be true. 


Figure 4.3.1 


Theorem 
4.3.1 


4.3 The Normal Distribution 


The Poisson limit described in Section 4.2 was not the only, or the first, approx- 
imation developed for the purpose of facilitating the calculation of binomial 
probabilities. Early in the eighteenth century, Abraham DeMoivre proved that 


72 ‘s 
areas under the curve f,(z) = ree 2 oo <z< oo, can be used to estimate 


—nlt 
Pla < X=n(3) < o|, where X is a binomial random variable with a large n and 
nl 


_¥)Q) 
pal 


= 
Figure 4.3.1 illustrates the central idea in DeMoivre’s discovery. Pictured is a 


probability histogram of the binomial distribution with n = 20 and p = 5. Super- 


: : : ‘ (=10)" ; 
imposed over the histogram is the function fy(y) = ae? >. Notice how 


closely the area under the curve approximates the area of the bar, even for this rela- 
tively small value of n. The French mathematician Pierre-Simon Laplace generalized 
DeMoivre’s original idea to binomial approximations for arbitrary p and brought 
this theorem to the full attention of the mathematical community by including it in 
his influential 1812 book, Theorie Analytique des Probabilities. 


0.2 


0.15 


Probability 
° 
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Let X be a binomial random variable defined on n independent trials for which p = 
P (success). For any numbers a and b, 


—np 


li P| _ <0| F i Bie 
im Sass SSN = e~ v4 
Jnp(i — p) Qn a 


n—>co 


240 Chapter 4 Special Distributions 


Proof One of the ways to verify Theorem 4.3.1 is to show that the limit of the 
moment-generating function for ~="2— as n > 00 is e’ /* and that e”/? is also 


: vnp(1— p) 
the value of [°° e”. ree 2 dz. ey Theorem 3.12.2, then, the limiting pdf of 
_— _X=np 


<ti-py i8 the function fz(z) = z=e* */2, 90 <z < 00. See Appendix 4.A.2 


for aie ‘proof of a more general result. 


Ti 


Comment We saw in Been ss : that Poisson’s limit is actually a special case of 
Poisson’s distribution, px (k) = Gy- iM k= 0, 1,2,.... Similarly, the DeMoivre-Laplace 
limit is a pdf in its own right. Justifying that assertion, of course, requires proving 
that fz(z) = a he integrates to 1 for —co <z <o0. 

Curiously, there is no algebraic or trigonometric substitution that can be used to 
demonstrate that the area under fz(z) is 1. However, by using polar coordinates, 
we can verify a necessary and sufficient alternative—namely, that the square of 


90% A et/2 
oo Fe dz equals 1. 


To begin, note that 


Zz [. reed : [ ig : im ie ee did 
—— e xX: e y= e x dy 
21 Joo 21 Joo 21 —oo J —00 

Let x =r cos 6 and y=r sin 6, so dx dy =rdrdé. Then 


2a 
ff. eT BHI) dy eran er dr do 


Qn 
re” dr. qe do 
~ ae Jo 0 


=1 


Comment The function fz(z) = etl ? is referred to as the standard normal (or 


Gaussian) curve. By convention, any random variable whose probabilistic behavior 
is described by a standard normal curve is denoted by Z (rather than X, Y, or W). 
Since Mz(t)= ef (2, it follows readily that E(Z) =0 and Var(Z) = 1. 


Finding Areas Under the Standard Normal Curve 


In order to use Theorem 4.3.1, we need to be able to find the area under the graph of 
fz(z) above an arbitrary interval [a, b]. In practice, such values are obtained in one 
of two ways—either by using a normal table, a copy of which appears at the back 
of every statistics book, or by running a computer software package. Typically, both 
approaches give the cdf, Fz(z) = P(Z < z), associated with Z (and from the cdf we 
can deduce the desired area). 

Table 4.3.1 shows a portion of the normal table that appears in Appendix A.1. 
Each row under the Z heading represents a number along the horizontal axis of 
fz(z) rounded off to the nearest tenth; Columns 0 through 9 allow that number to be 
written to the hundredths place. Entries in the body of the table are areas under the 
graph of fz(z) to the left of the number indicated by the entry’s row and column. For 
example, the number listed at the intersection of the “1.1” row and the “4” column 
is 0.8729, which means that the area under fz(z) from —co to 1.14 is 0.8729. That is, 


1.14 
i445 
—_e~*"? dz = 0.8729 = P(—00 < Z < 1.14) = Fz(1.14) 
ie V 20 . 
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Table 4.3.1 
Z 0 1 2 3 4 5 6 7 8 9 


—3. 0.0013 0.0010 0.0007 0.0005 0.0003 0.0002 0.0002 0.0001 0.0001 0.0000 


—0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121 
—0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483 
—0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859 
—0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247 
—0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641 
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753 
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141 
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517 
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 
0.7 0.7580 0.7611 0.7642 0.7673 0.7703 0.7734 0.7764 0.7794 0.7823 0.7852 
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 
1.3. 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9278 0.9292 0.9306 0.9319 


3. 0.9987 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 1.0000 


(see Figure 4.3.2). 


Figure 4.3.2 0.4 


Areas under f,(z) to the right of a number or between two numbers can also be 
calculated from the information given in normal tables. Since the total area under 
fz(z) is 1, 

P(b<Z <+oo) =area under f7(z) to the right of b 
= 1 — area under fz(z) to the left of b 
=1—-P(-—ow <Z <b) 
=1- Fz(b) 
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Figure 4.3.3 


Example 


4.3.1 


Similarly, the area under f,(z) between two numbers a and b is necessarily the 
area under fz(z) to the left of b minus the area under fz(z) to the left of a: 


P(a<Z <b) =area under fz(z) betweena andb 
= area under fz(z) to the left of b — area under fz(z) to the left ofa 
= P(-7w <Z<b)— P(-«~ <Z <a) 
= Fz(b) — Fz(a) 


The Continuity Correction 


Figure 4.3.3 illustrates the underlying “geometry” implicit in the DeMoivre-Laplace 
Theorem. Pictured there is a continuous curve, f(y), approximating a histogram, 
where we can presume that the areas of the rectangles are representing the probabil- 
ities associated with a discrete random variable X. Clearly, / ' f(y) dy is numerically 
similar to P(a < X <b), but the diagram suggests that the approximation would be 
even better if the integral extended from a — 0.5 to b +0.5, which would then include 
the cross-hatched areas. That is, a refinement of the technique of using areas under 
continuous curves to estimate probabilities of discrete random variables would be 
to write 
b+0.5 


Pasx=b)+[ f(y) dy 


a—0.5 
The substitution of a — 0.5 for a and b+ 0.5 for b is called the continuity correc- 
tion. Applying the latter to the DeMoivre-Laplace approximation leads to a slightly 
different statement for Theorem 4.3.1: If X is a binomial random variable with 
parameters n and p, 


PasX<b)=F,|-- oe || 


vap=p)l} ~~ Lvapd— py 


xX 
SY 


2S 


a 
x 


a 
x 


2K 
és 


BOR 
RX 


anges a a+1a+2 b-1 5b N55 405 


Comment Even with the continuity correction refinement, normal curve approxi- 
mations can be inadequate if n is too small, especially when p is close to 0 or to 1. As 
a rule of thumb, the DeMoivre-Laplace limit should be used only if the magnitudes 


of n and p are such that n > of and n > oS. 


Boeing 757s flying certain routes are configured to have 168 economy-class seats. 
Experience has shown that only 90% of all ticket holders on those flights will actu- 
ally show up in time to board the plane. Knowing that, suppose an airline sells 178 
tickets for the 168 seats. What is the probability that not everyone who arrives at the 
gate on time can be accommodated? 
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Let the random variable X denote the number of would-be passengers who 
show up for a flight. Since travelers are sometimes with their families, not every 
ticket holder constitutes an independent event. Still, we can get a useful approxima- 
tion to the probability that the flight is overbooked by assuming that X is binomial 
with n = 178 and p=0.9. What we are looking for is P(169 < X < 178), the probabil- 
ity that more ticket holders show up than there are seats on the plane. According to 
Theorem 4.3.1 (and using the continuity correction), 


P(Flight is overbooked) = P(169 < X < 178) 


= | - X —np ” a 
vnpU—p) ~ Snpd-p)~ VnpU— p) 

= | — 178(0.9) 3 X — 178(0.9) ts 178.5 — S| 
/178(0.9)(0.1) ~ 178(0.9)(0.1) ~ /178(0.9)(0.1) 


=~ P(2.07 < Z <4.57) = F,(4.57) — F,(2.07) 


From Appendix A.1, Fz(4.57) = P(Z < 4.57) is equal to 1, for all practical 
purposes, and the area under f7(z) to the left of 2.07 is 0.9808. Therefore, 


P(Flight is overbooked) = 1.0000 — 0.9808 
= 0.0192 


implying that the chances are about one in fifty that not every ticket holder will have 
a seat. a 


Case Study 4.3.1 


Research in extrasensory perception has ranged from the slightly unconven- 
tional to the downright bizarre. Toward the latter part of the nineteenth century 
and even well into the twentieth century, much of what was done involved spir- 
itualists and mediums. But beginning around 1910, experimenters moved out 
of the seance parlors and into the laboratory, where they began setting up con- 
trolled studies that could be analyzed statistically. In 1938, Pratt and Woodruff, 
working out of Duke University, did an experiment that became a prototype for 
an entire generation of ESP research (71). 

The investigator and a subject sat at opposite ends of a table. Between them 
was a screen with a large gap at the bottom. Five blank cards, visible to both 
participants, were placed side by side on the table beneath the screen. On the 
subject’s side of the screen one of the standard ESP symbols (see Figure 4.3.4) 
was hung over each of the blank cards. 


| } OVA + 


Figure 4.3.4 


(Continued on next page) 
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(Case Study 4.3.1 continued) 


The experimenter shuffled a deck of ESP cards, picked up the top one, and 
concentrated on it. The subject tried to guess its identity: If he thought it was a 
circle, he would point to the blank card on the table that was beneath the circle 
card hanging on his side of the screen. The procedure was then repeated. Alto- 
gether, a total of thirty-two subjects, all students, took part in the experiment. 
They made a total of sixty thousand guesses—and were correct 12,489 times. 

With five denominations involved, the probability of a subject’s making a 

1 


correct identification just by chance was =. Assuming a binomial model, the 


expected number of correct guesses would be 60,000 x i, or 12,000. The ques- 
tion is, how “near” to 12,000 is 12,489? Should we write off the observed excess 
of 489 as nothing more than luck, or can we conclude that ESP has been 
demonstrated? 

To effect a resolution between the conflicting “luck” and “ESP” hypothe- 
ses, we need to compute the probability of the subjects’ getting 12,489 or more 
correct answers under the presumption that p= i, Only if that probability is very 
small can 12,489 be construed as evidence in support of ESP. 

Let the random variable X denote the number of correct responses in sixty 
thousand tries. Then 


60,000 k 60,000—k 
60,000\ /1\* (4\ 
P(X > 12,489) = )~ ( : )(G) (5) (4.3.1) 


k=12,489 


At this point the DeMoivre-Laplace limit theorem becomes a welcome alterna- 
tive to computing the 47,512 binomial probabilities implicit in Equation 4.3.1. 
First we apply the continuity correction and rewrite P(X > 12,489) as P(X > 
12,488.5). Then 


ee 12,488.5 — 60,000(1 
p> 12480) P| mp. 12, ,000( a 


JVnp—p)~ — ,/60,000(1/5)(4/5) 


=p | > 4.99] 
VnpU — p) — 


= et i: ee 2d Zz 
JV 2m J 4.99 
= 0.0000003 


this last value being obtained from a more extensive version of Table A.1 in the 
Appendix. 

Here, the fact that P(X > 12,489) is so extremely small makes the “luck” 
hypothesis (p = i) untenable. It would appear that something other than chance 
had to be responsible for the occurrence of so many correct guesses. Still, it does 
not follow that ESP has necessarily been demonstrated. Flaws in the experi- 
mental setup as well as errors in reporting the scores could have inadvertently 
produced what appears to be a statistically significant result. Suffice it to say that 
a great many scientists remain highly skeptical of ESP research in general and 
of the Pratt-Woodruff experiment in particular. [For a more thorough critique 
of the data we have just described, see (43).] 
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About the Data This is a good set of data for illustrating why we need formal math- 
ematical methods for interpreting data. As we have seen on other occasions, our 
intuitions, when left unsupported by probability calculations, can often be deceived. 
A typical first reaction to the Pratt-Woodruff results is to dismiss as inconsequen- 
tial the 489 additional correct answers. To many, it seems entirely believable that 
sixty thousand guesses could produce, by chance, an extra 489 correct responses. 
Only after making the P(X > 12,489) computation do we see the utter implausibility 
of that conclusion. What statistics is doing here is what we would like it to do in 
general—rule out hypotheses that are not supported by the data and point us in the 
direction of inferences that are more likely to be true. 


Questions 


4.3.1. Use Appendix Table A.1 to evaluate the following 
integrals. In each case, draw a diagram of f7(z) and shade 
the area that corresponds to the integral. 


(a) fon Agee? dz 
(b) [oe Fee? dz 
(c) Heer meee dz 
d) [iY see dz 


4.3.2. Let Z be a standard normal random variable. Use 
Appendix Table A.1 to find the numerical value for each 
of the following probabilities. Show each of your answers 
as an area under f7(z). 


(a) P(O< Z <2.07) 

(b) P(—0.64 < Z < —0.11) 
(c) P(Z > —1.06) 

(d) P(Z < —2.33) 

(e) P(Z>4.61) 


4.3.3. 
(a) Let 0 <a <b. Which number is larger? 


b —a 
1 2 1 2 
edz or / ——e* "dz 
i V2 -» V20 


(b) Let a> 0. Which number is larger? 


atl y i a+l/2 4 bp 
——e~'dz or / ——e*'dz 
i V2 a-1f2, V20 


4.3.4. 


(a) Evaluate i edz, 
(b) Evaluate [® 6e7*/? dz. 


4.3.5. Assume that the random variable Z is described by 
a standard normal curve fz(z). For what values of z are 
the following statements true? 


(a) P(Z <z)=0.33 

(b) P(Z>z)=0.2236 

(c) P(—1.00 < Z < z) =0.5004 
(d) P(—z < Z <z) =0.80 

(e) P(z<Z<2.03)=0.15 


4.3.6. Let z, denote the value of Z for which P(Z > 
Za) =a. By definition, the interquartile range, Q, for the 
standard normal curve is the difference 


O = 225 — 275 


Find Q. 


4.3.7. Oak Hill has 74,806 registered automobiles. A city 
ordinance requires each to display a bumper decal show- 
ing that the owner paid an annual wheel tax of $50. By 
law, new decals need to be purchased during the month of 
the owner’s birthday. This year’s budget assumes that at 
least $306,000 in decal revenue will be collected in Novem- 
ber. What is the probability that the wheel taxes reported 
in that month will be less than anticipated and produce a 
budget shortfall? 


4.3.8. Hertz Brothers, a small, family-owned radio man- 
ufacturer, produces electronic components domestically 
but subcontracts the cabinets to a foreign supplier. 
Although inexpensive, the foreign supplier has a quality- 
control program that leaves much to be desired. On the 
average, only 80% of the standard 1600-unit shipment that 
Hertz receives is usable. Currently, Hertz has back orders 
for 1260 radios but storage space for no more than 1310 
cabinets. What are the chances that the number of usable 
units in Hertz’s latest shipment will be large enough to 
allow Hertz to fill all the orders already on hand, yet small 
enough to avoid causing any inventory problems? 
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4.3.9. Fifty-five percent of the registered voters in 
Sheridanville favor their incumbent mayor in her bid 
for re-election. If four hundred voters go to the polls, 
approximate the probability that 


(a) the race ends in a tie. 
(b) the challenger scores an upset victory. 


4.3.10. State Tech’s basketball team, the Fighting Loga- 
rithms, have a 70% foul-shooting percentage. 


(a) Write a formula for the exact probability that out of 
their next one hundred free throws, they will make 
between seventy-five and eighty, inclusive. 

(b) Approximate the probability asked for in part (a). 


4.3.11. A random sample of 747 obituaries published 
recently in Salt Lake City newspapers revealed that 
344 (or 46%) of the decedents died in the three-month 
period following their birthdays (123). Assess the sta- 
tistical significance of that finding by approximating the 
probability that 46% or more would die in that particu- 
lar interval if deaths occurred randomly throughout the 
year. What would you conclude on the basis of your 
answer? 


Central Limit Theorem 


4.3.12. There is a theory embraced by certain parapsy- 
chologists that hypnosis can enhance a person’s ESP abil- 
ity. To test that hypothesis, an experiment was set up with 
fifteen hypnotized subjects (21). Each was asked to make 
100 guesses using the same sort of ESP cards and proto- 
col that were described in Case Study 4.3.1. A total of 326 
correct identifications were made. Can it be argued on the 
basis of those results that hypnosis does have an effect on 
a person’s ESP ability? Explain. 


4.3.13. If py(k) = (1) 0.7)'(0.3)!0*, k=0, 1, ..., 10, is it 
appropriate to approximate P(4 < X < 8) by computing 
the following? 

3.5—10(0.7) _ 


Z 
/ 10(0.7) (0.3) — 


2 8.5 — 10(0.7) 
~ /10(0.7)(0.3) 
Explain. 


4.3.14. A sell-out crowd of 42,200 is expected at Cleve- 
land’s Jacobs Field for next Tuesday’s game against the 
Baltimore Orioles, the last before a long road trip. The 
ballpark’s concession manager is trying to decide how 
much food to have on hand. Looking at records from 
games played earlier in the season, she knows that, on the 
average, 38% of all those in attendance will buy a hot dog. 
How large an order should she place if she wants to have 
no more that a 20% chance of demand exceeding supply? 


It was pointed out in Example 3.9.3 that every binomial random variable X can 


be written as the sum of n independent Bernoulli random variables X,, X2,..., Xn; 
where 
a 1 with probability p 
‘~~ | 0. with probability 1 — p 
But if X = X, + X.+---+X,, Theorem 4.3.1 can be reexpressed as 
Rte Ce ee eee 1 if ay 
lim Pla< <b|=—— | e? "dz (4.3.2) 
n—>0o np(1— p) sf Oa dé 


Implicit in Equation 4.3.2 is an obvious question: Does the DeMoivre-Laplace 
limit apply to sums of other types of random variables as well? Remarkably, the 
answer is “yes.” Efforts to extend Equation 4.3.2 have continued for more than 150 
years. Russian probabilists— A. M. Lyapunov, in particular— made many of the key 
advances. In 1920, George Polya gave these new generalizations a name that has 
been associated with the result ever since: He called it the central limit theorem 


(136). 


Theorem 
4.3.2 


(Central Limit Theorem) Let Wi, W2,...be an infinite sequence of independent ran- 
dom variables, each with the same distribution. Suppose that the mean «c and the 


variance o7 of fw(w) are both finite. For any numbers a and b, 


Example 
4.3.2 
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lim P (« < 


noo 


S32 —_ b 9 
Wit + Wh ny <b\=— 1 2/2 dz 
/no ~ 


Proof See Appendix 4.A.2. 


Comment The central limit theorem is often stated in terms of the average of W,, 
W>2,...,and W,, rather than their sum. Since 


1 = 1 
B| AoW, ++) |= B07) =n and var] (Wi +--+) ]=o2/m, 
n n 
Theorem 4.3.2 can be stated in the equivalent form 
im elae Se [ end 
a< < = — e* Zz 
n—->0o o//n On Ja 


We will use both formulations, the choice depending on which is more convenient 
for the problem at hand. 


The top of Table 4.3.2 shows a Minitab simulation where forty random samples of 
size 5 were drawn from a uniform pdf defined over the interval [0, 1]. Each row 
corresponds to a different sample. The sum of the five numbers appearing in a given 
sample is denoted “y” and is listed in column C6. For this particular uniform pdf, 
w= 5 and o*= + (recall Question 3.6.4), so 


Wits +Wr-nm  Y-3 


Jno ne 
12 
Table 4.3.2 
Cl C2 C3 C4 C5 C6 C7 
yl y2 y3 y4 y5 y Z ratio 

1 0.556099 0.646873 0.354373 0.673821 0.233126 2.46429 —0.05532 

2 0.497846 0.588979 0.272095 0.956614 0.819901 3.13544 0.98441 

3. 0.284027 =—:0.209458 = 0.414743 (0.614309 =0.439456 =—«1.96199 =9—0.83348 

4 0.599286 0.667891 0.194460 0.839481 0.694474 2.99559 0.76777 

5 0.280689 0.692159 0.036593 0.728826 0.314434 2.05270 —0.69295 

6 0.462741 0.349264 0.471254 0.613070 0.489125 2.38545  —0.17745 

7 0.556940 0.246789 0.719907 0.711414 0.918221 3.15327 1.01204 

8 0.102855 0.679119 0.559210 0.014393 0.518450 1.87403 —0.96975 

9 0.642859 0.004636 0.728131 0.299165 0.801093 2.47588 —0.03736 
10 0.017770 0.568188 0.416351 0.908079 0.075108 1.98550 —0.79707 
11 =0.331291 0.410705 0.118571 0.979254 0.242582 2.08240 —0.64694 
12. 0.355047 0.961126 0.920597 0.575467 0.585492 3.39773 1.39076 
13. 0.626197 0.304754 0.530345 0.933018 0.675899 3.07021 0.88337 
14 0.211714 0.404505 0.045544 0.213012 0.520614 1.39539 —1.71125 
15 0.535199 0.130715 0.603642 0.333023 0.405782 2.00836 —0.76164 
16 = 0.810374 0.153955 0.082226 0.827269 0.897901 2.77172 0.42095 
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Table 4.3.2 (continued) 


Cil C2 C3 C4 C5 C6 C7 
yl y2 y3 y4 y5 y Z ratio 


17. 0.687550 =—-0.185393 0.620878 0.013395 0.819712 2.32693 —0.26812 
18 0.424193 0.529199 0.201554 =—-0.157073 =—-0.090455 =: 1.40248 + = —1.70028 
19 0.397373 0.143507 0.973991 0.234845 =—-0.681147 2.43086 = —0.10711 
20 0.413788 0.653468 0.017335 0.556255 0.900568 2.54141 0.06416 
21 0.602607 0.094162 0.247676 0.638875 0.653910 2.23723 —0.40708 
22 0.963678 0.375850 0.909377 0.307358 0.828882 3.38515 1.37126 
23. 0.967499 0.868809 0.940770 0.405564 0.814348 3.99699 2.31913 
24 0.439913 0.446679 =0.075227 =0.983295 0.554581 2.49970 —0.00047 
25. 0.215774 ~=0.407494 0.002307 0.971140 0.437144 2.03386 —0.72214 
26 0.108881 0.271860 0.972351 0.604762 0.210347 2.16820 —0.51402 
27 ~—(0.337798 ~—-<0.173911 0.309916 ~=—0.300208 ~=—: 0.666831 ~=—-:1.78866 +==—1.10200 
28 0.635017 0.187311 0.365419 0.831417 0.463567 = =2.48273 ~==—0.02675 
29 0.563097 ~=0.065293 0.841320 0.518055 0.685137 2.67290 0.26786 
30 = 0.687242 =—0.544286 ~=—-0.980337 ~=—0.649507 ~=—:0.077364 =. 2.93874 0.67969 
31 0.784501 =0.745614 =0.459559_ = 0.565875 0.529171 = 3.08472 0.90584 
32 0.505460 =0.355340 =—(0.163285 = 0.352540 =—0.896521 = 2.27315 = —0.35144 
33 0.336992 ~=—-0.734869 =: 0.824409 =: 0.321047 = 0.682283 =. 2.89960 0.61906 
34 0.784279 ~=—-0.194038 =—-0.323756 ~=—0.430020 =—-0.459238 = 2.19133. = — 0.47819 
35. 0.548008 = 0.788351 ~=—-0.831117 =—-0.200790)=—-0.823102 = 3.19137 1.07106 
36 =: 0.096383: 0.844281 =—0.680927 = 0.656946 =0.050867 =2.32940 + —0.26429 
37. =: 0.161502 =0.972933, 0.038113 = 0.515530 =0.553788 = 2.24187 + =—0.39990 
38 = =0.677552 = 0.232181 ~=—0.307234 =—0.588927 =—:0.365403 2.17130 §=9—0.50922 
39 = 0.470454 =0.267230 =—-0.652802 += 0.633286 0.410964 2.43474 —0.10111 
40 0.104377 0.819950 0.047036 0.189226 0.399502 1.56009 —1.45610 


0.4 


0.3 


Density 
=) 
N 


3.5 —2.5 -1.5 -0.5 0.5 15 2.5, 3.5 


Z ratio 


At the bottom of Table 4.3.2 is a density-scaled histogram of the forty “Z ratios,” 
oe (as listed in column C7). Notice the close agreement between the distribution 
of those ratios and fz(z): What we see there is entirely consistent with the statement 


of Theorem 4.3.2. 


Example 
4.3.3 


Example 
4.3.4 
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Comment Theorem 4.3.2 is an asymptotic result, yet it can provide surprisingly 
good approximations even when n is very small. Example 4.3.2 is a typical case in 
point: The uniform pdf over [0, 1] looks nothing like a bell-shaped curve, yet ran- 
dom samples as small as n = 5 yield sums that behave probabilistically much like the 
theoretical limit. 

In general, samples from symmetric pdfs will produce sums that “converge” 
quickly to the theoretical limit. On the other hand, if the underlying pdf is sharply 
skewed —for example, fy(y)= 10e7!°”, y > 0—it would take a larger n to achieve the 
level of agreement present in Figure 4.3.2. = 


A random sample of size n = 15 is drawn from the pdf fy(y) =3(1— y)?,0<y<1. 
- 15 - 
Let Y=(,4) > Y;. Use the central limit theorem to approximate P(} <Y <3). 
i=l 
Note, first of all, that 


1 
1 
Ea)= | y-3(1-ydy= 5 
0 
and 
1 iy 3 
o? = Var(¥) = £0) == f y’-3(1—y)*dy = 
ij 4 80 


According, then, to the central limit theorem formulation that appears in the com- 
ment on p. 247, the probability that Y will lie between { and 5 is approximately 
0.99: 


1 1 3 1 


p(csfs3)=p Bes Y= Pe ea 
= P(—2.50 < Z < 2.50) 
= 0.9876 


In preparing next quarter’s budget, the accountant for a small business has one hun- 
dred different expenditures to account for. Her predecessor listed each entry to the 
penny, but doing so grossly overstates the precision of the process. As a more truth- 
ful alternative, she intends to record each budget allocation to the nearest $100. 
What is the probability that her total estimated budget will end up differing from 
the actual cost by more than $500? Assume that Y;, Y2, ..., Yio0, the rounding errors 
she makes on the one hundred items, are independent and uniformly distributed 
over the interval [—$50, +$50]. 
Let 


Sioo = ¥1 + Yo +---+Yioo 


= total rounding error 


What the accountant wants to estimate is P(|Sio9| > $500). By the distribution 
assumption made for each Y,, 


EY js0... 731.2,..,, 100 
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and 


Therefore, 


" 50 1 ; 
Var(Y;) = E( =f om y 

__ 2500 

a) 


E(Sio0) = E(% + Y2+--++ Yio0) =0 


and 


2500 
Var(Sio0) = Var(Y1 + Y2 +---+ Yioo) = 100 > 


__ 250,000 


3 


Applying Theorem 4.3.2, then, shows that her strategy has roughly an 8% chance of 
being in error by more than $500: 


P(|Sio0| > $500) = 1 — P(—500 < Sjo9 < 500) 


_ (= Sm 0. 500) 
500/V3 ~ 500/V3 ~~ 500/V3 
=1-— P(-1.73 < Z <1.73) 
= 0.0836 - 


Questions 


4.3.15. A fair coin is tossed two hundred times. Let X; = 1 
if the ith toss nme up heads and X; = 0 otherwise, i = 


1,2,...,200; X = YX, Calculate the central limit theo- 


rem approximation for P(|X — E(X)| <5). How does this 
differ from the DeMoivre-Laplace approximation? 


4.3.16. Suppose that one hundred fair dice are tossed. 
Estimate the probability that the sum of the faces show- 
ing exceeds 370. Include a continuity correction in your 
analysis. 


4.3.17. Let X be the amount won or lost in betting $5 
on red in roulette. Then p,(5) = % and p,(—5) = 2. If 
a gambler bets on red one hundred times, use the cen- 
tral limit theorem to estimate the probability that those 


wagers result in less than $50 in losses. 


4.3.18. If X;, X2,...,X, are independent Poisson 
random variables with es M1, A2,-+-,An, Tespec- 
tively, and if X = X,; + X.+---+ X, then xX is a 


Poisson random variable with parameter 4 = = A; (recall 
i=1 


Example 3.12.10). What specific form does the ratio 
in Theorem 4.3.2 take if the X,’s are Poisson random 
variables? 


4.3.19. An electronics firm receives, on the average, fifty 
orders per week for a particular silicon chip. If the com- 
pany has sixty chips on hand, use the central limit theorem 
to approximate the probability that they will be unable to 
fill all their orders for the upcoming week. Assume that 
weekly demands follow a Poisson distribution. (Hint: See 
Question 4.3.18.) 


4.3.20. Considerable controversy has arisen over the pos- 
sible aftereffects of a nuclear weapons test conducted 
in Nevada in 1957. Included as part of the test were 
some three thousand military and civilian “observers.” 
Now, more than fifty years later, eight cases of leukemia 
have been diagnosed among those three thousand. The 
expected number of cases, based on the demographic 
characteristics of the observers, was three. Assess the sta- 
tistical significance of those findings. Calculate both an 
exact answer using the Poisson distribution as well as an 
approximation based on the central limit theorem. 
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The Normal Curve as a Model for Individual Measurements 


Because of the central limit theorem, we know that sums (or averages) of virtually 
any set of random variables, when suitably scaled, have distributions that can be 
approximated by a standard normal curve. Perhaps even more surprising is the fact 
that many individual measurements, when suitably scaled, also have a standard nor- 
mal distribution. Why should the latter be true? What do single observations have 
in common with samples of size n? 

Astronomers in the early nineteenth century were among the first to understand 
the connection. Imagine looking through a telescope for the purpose of determining 
the location of a star. Conceptually, the data point, Y, eventually recorded is the 
sum of two components: (1) the star’s true location 4* (which remains unknown) 
and (2) measurement error. By definition, measurement error is the net effect of all 
those factors that cause the random variable Y to have a value different from p*. 
Typically, these effects will be additive, in which case the random variable can be 
written as a sum, 


Y=pu*+Wi+Wr+---+W, (4.3.3) 


where W,, for example, might represent the effect of atmospheric irregularities, W2 
the effect of seismic vibrations, W3 the effect of parallax distortions, and so on. 

If Equation 4.3.3 is a valid representation of the random variable Y, then it 
would follow that the central limit theorem applies to the individual Y;’s. More- 
over, if 


EQ)=E(Qe*+wWi+W2+---+Wj)=p 
and 
Var(Y) = Var(uu* + W, + Wo+---+W,) =o? 


the ratio in Theorem 4.3.2 takes the form re, Furthermore, t is likely to be very 
large, so the approximation implied by the central limit theorem is essentially an 
equality —that is, we take the pdf of i 4 to be fz(z). 

Finding an actual formula for fy(y), then, becomes an exercise in applying 
Theorem 3.8.2. Given that you =Z, 


Y=pu+oZ 


and 


oO 
= : gut), —c<y<oo 
J/210 ; 


Definition 4.3.1. A random variable Y is said to be normally distributed with 
mean jz and variance o? if 


free), -we<y<co 


V 210 


The symbol Y ~ N(u, 0”) will sometimes be used to denote the fact that Y has 


anormal distribution with mean jw and variance o”. 
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Comment Areas under an “arbitrary” normal distribution, fy(y), are calculated by 
finding the equivalent area under the standard normal distribution, fz (z): 


Y b - b- 
Pasy<b)=P(4 BZ h H)=e(¢ se a “) 
oO 


o o oO ~ oO 


The ratio “4 is often referred to as either a Z transformation or a Z score. 


Example In most states a motorist is legally drunk, or driving under the influence (DUI), if his 
4.3.5 or her blood alcohol concentration, Y, is 0.08% or higher. When a suspected DUI 
offender is pulled over, police often request a sobriety test. Although the breath 
analyzers used for that purpose are remarkably precise, the machines do exhibit 
a certain amount of measurement error. Because of that variability, the possibility 
exists that a driver’s true blood alcohol concentration may be under 0.08% even if 

the analyzer gives a reading over 0.08%. 

Experience has shown that repeated breath analyzer measurements taken from 
the same person produce a distribution of responses that can be described by a nor- 
mal pdf with ~ equal to the person’s true blood alcohol concentration and o equal 
to 0.004%. Suppose a driver is stopped at a roadblock on his way home from a party. 
Having celebrated a bit more than he should have, he has a true blood alcohol con- 
centration of 0.075%, just barely under the legal limit. If he takes the breath analyzer 
test, what are the chances that he will be incorrectly booked on a DUI charge? 
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Figure 4.3.5 


Since a DUI arrest occurs when Y > 0.08%, we need to find P(Y > 0.08) when 
jt = 0.075 and o = 0.004 (the percentage is irrelevant to any probability calculation 
and can be ignored). An application of the Z transformation shows that the driver 
has almost an 77% chance of being falsely accused: 
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Y —0.07 .080 — 0.07 
PCy =0.08)= P ( 0.0 5 ., 0.080 0.0 >) 


0.004 — 0.004 
= P(Z>1.25)=1— P(Z < 1.25) 
= 1—0.8944 = 0.1056 


Figure 4.3.5 shows fy(y), fz(z), and the two areas that are equal. = 


Case Study 4.3.2 


For his many notable achievements, Sir Francis Galton (1822-1911) is much 
admired by scientists and statisticians, but not so much by criminals (at least not 
criminals who know something about history). What should rankle the incarcer- 
ated set is the fact that Galton did groundbreaking work in using fingerprints 
for identification purposes. Late in the nineteenth century, he showed that all 
fingerprints could be classified into three generic types—the whorl, the loop, 
and the arch (see Figure 4.3.6). A few years later, Sir Edward Richard Henry, 
who would eventually become Commissioner of Scotland Yard, refined Gal- 
ton’s system to include eight generic types. The Henry system, as it came to 
be known, was quickly adopted by law enforcement agencies worldwide and 
ultimately became the foundation for the first AFIS (Automated Fingerprint 
Identification System) databases introduced in the 1990s. 


Figure 4.3.6 


There are many characteristics besides the three proposed by Galton and 
the eight proposed by Henry that can be used to distinguish one fingerprint 
from another. Among the most objective of these is the ridge count. In the 
loop pattern, there is a point—the triradius—where the three opposing ridge 
systems come together. A straight line drawn from the triradius to the center 
of the loop will cross a certain number of ridges; in Figure 4.3.7, that number 
is eleven. Adding the numbers of ridge crossings for each finger yields a sum 
known as the ridge count. 

Consider the following scenario. Police are investigating the murder of a 
pedestrian in a heavily populated urban area that is thought to have been a 
gang-related, drive-by shooting, perhaps as part of an initiation ritual. No eye- 
witnesses have come forth, but an unregistered gun was found nearby that the 

(Continued on next page) 
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(Case Study 4.3.2 continued) 


ballistics lab has confirmed was the murder weapon. Lifted from the gun was a 
partially smudged set of latent fingerprints. None of the features typically used 
for identification purposes was recognizable except for the ridge count, which 
appeared to be at least 270. The police have arrested a young man who lives in 
the area, is known to belong to a local gang, has no verifiable alibi for the night 
of the shooting, and has a ridge count of 275. His trial is about to begin. 
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Figure 4.3.7 


Neither the state nor the defense has a strong case. Both sides have 


no choice but to base their arguments on the statistical implications of the 
defendant’s ridge count. And both sides have access to the same background 
information—that ridge counts for males are normally distributed with a mean 
(w) of 146 and a standard deviation (c) of 52. 


The state’s case 

Clearly, the defendant has an unusually high ridge count. The strength of the 
prosecutor’s case hinges on How unusual. According to the Z transformation 
given on p. 252 (together with a continuity correction), the probability of a ridge 
count, Y, being at least 275 is 0.0068: 


146 . 274.5 — 146 
2 07 52 


y— 
P(Y =>275)=P ( 5 ) = P(Z> 2.47) =0.0068 


This is great news for the prosecutor: jurors will most likely interpret a 
probability that small as being very strong evidence against the defendant. 


(Continued on next page) 


4.3. The Normal Distribution 255 


The defense’s case 


The defense must necessarily try to establish “reasonable doubt” by showing 
that the probability is fairly high that someone other than the defendant could 
have committed the murder. To make that argument requires an application of 
conditional probability as it pertains to the binomial distribution. 

Suppose n male gang members were riding around on the night of the 
shooting and could conceivably have committed the crime, and let X denote 
the number of those individuals who have ridge counts of at least 270. Then 


pix=K=( ; )r'a — py 


where 


146 269.5 — 146 
eas 


y— 
p= P(r = 270)= P/ 5 ) = Pz = 2.38) =0.0087 


2 52 
Also, 
P(X=l1)= ( : ) pid = py! =np(1— py! 
P(X>1)=1-— P(X =0)=1-(1-— p)” 
and 
Therefore, 
P(X > 2) 
P(X => 2)|P(X = 1) = ———— 
P(X>1) 
= 1—(_—-p)"—-npdl py 
7 1—(—p)" 


= P (at least two persons have ridge counts 
> 270| at least one person has a ridge count > 270) 
= P(at least one other person besides the defendant 


could have committed the murder) 


How large was n on the night of the shooting? There is no way to know, but 
it could have been sizeable, given the amount of gang activity found in many 
metropolitan areas. Table 4.3.3 lists the values of P(X > 2|X > 1) calculated for 


Table 4.3.3 
h OP Ok = 21K 1) 


25 0.10 
50 0.20 
100 0.37 
150 0.51 
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Example 


4.3.6 


various values of n ranging from 25 to 200. For example, if n = 50 gang members 
including the defendant were riding around on the night of the murder, there 
is a 20% chance that at least one other individual besides the defendant has a 
ridge count of at least 270 (and might be the shooter). 

Imagine yourself on the jury. Which is the more persuasive statistical anal- 
ysis, the state’s calculation that P(Y > 275) = 0.0068, or the defense’s tabulation 
of P(X > 2|X > 1)? Would your verdict be “guilty” or “not guilty”? 


About the Data Given that astrologers, psychics, and Tarot card readers still 
abound, it should come as no surprise that fingerprint patterns gave rise to their 
own form of fortune-telling (known more elegantly as dactylomancy). According to 
those who believe in such things, a person “having whorls on all fingers is restless, 
vacillating, doubting, sensitive, clever, eager for action, and inclined to crime.” Need- 
less to say, of course, “A mixture of loops and whorls signifies a neutral character, a 
person who is kind, obedient, truthful, but often undecided and impatient” (32). 


Mensa (from the Latin word for “mind”) is an international society devoted to 
intellectual pursuits. Any person who has an IQ in the upper 2% of the general 
population is eligible to join. What is the /owest IQ that will qualify a person for 
membership? Assume that IQs are normally distributed with , = 100 and o = 16. 

Let the random variable Y denote a person’s IQ, and let the constant y, be the 
lowest IQ that qualifies someone to be a card-carrying Mensan. The two are related 
by a probability equation: 


P(Y = yr) = 0.02 
or, equivalently, 
P(Y <y,)=1—0.02=0.98 (4.3.4) 


(see Figure 4.3.8). 
Applying the Z transformation to Equation 4.3.4 gives 


Be cpp OD) 5 ys 
< = < — < —— |] =— JU. 
a 16 16 16 


’ 
Area = 0.98 PSPs \ Area = 0.02 


100 yr, 
10 Qualifies for 


membership 


Figure 4.3.8 


Example 
4.3.7 


Theorem 
4.3.3 


Corollary 
4.3.1 
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From the standard normal table in Appendix Table A.1, though, 


P(Z <2.05) =0.9798= 0.98 


Since ee and 2.05 are both cutting off the same area of 0.02 under fz(z), they 


must be equal, which implies that 733 is the lowest acceptable IQ for Mensa: 


yz = 100+ 16(2.05) = 133 = 


Suppose a random variable Y has the moment-generating function My(t) = e*+*! S, 
Calculate P(—1<Y <9). 
To begin, notice that My(t) has the same form as the moment-generating 


function for a normal random variable. That is, 


2 242 
ett 8 = elite t*)/2 


where ps = 3 and o? = 16 (recall Example 3.12.4). To evaluate P(—1 < Y <9), then, 
requires an application of the Z transformation: 


ee. es 
< 
4. - 4 


P-1s¥ <9)=P/ <7F = 71 1.00 < Z < 1.50) 


= 0.9332 — 0.1587 
= 0.7745 = 


Let Y, be a normally distributed random variable with mean t1, and variance oi, 
and let Y, be a normally distributed random variable with mean {12 and variance as. 
Define Y =Y, + Y>. If Y, and Y2 are independent, Y is normally distributed with mean 


[1 + M2 and variance 0% + 03. 


Proof Let My,(t) denote the moment-generating function for Y;, i = 1,2, and let 
My(t) be the moment-generating function for Y. Since Y = Y; + Y2, and the Y;,’s are 
independent, 


My (t) = My, (t) - My, (t) 


= ctitt(ot?)/2, guatt(o3r?)/2 (See Example 3.12.4) 
= pitituayt+(o}+03) 2/2 
We recognize the latter, though, to be the moment-generating function for a normal 


random variable with mean jy; + 2 and variance oj +03. The result follows by 
virtue of the uniqueness property stated in Theorem 3.12.2. 


Let Y,, Y2, ..., Y, be arandom sample of size n from a normal distribution with mean 
n 


wand variance o7. Then the sample mean, Y = 1 > ¥;, is also normally distributed 
i=l 
with mean jw but with variance equal to o*/n (which implies that ak is a standard 


normal random variable, Z). < 
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Corollary Let Y\, Yo, ..., Y, be any set of independent normal random variables with means 
4.3.2 M1, 2, ..5 Un and variances ot, a, .. 02, respectively. Let a1, ao, ..., Gn be any 
set of constants. Then Y =a,Y, + a2¥Y2+---+a,Y, is normally distributed with mean 

n n 
[k= >~a;[4; and variance o? =~ a4o?. < 

i=l i=l 
Example The elevator in the athletic dorm at Swampwater Tech has a maximum capacity of 
4.3.8 twenty-four hundred pounds. Suppose that ten football players get on at the twen- 


tieth floor. If the weights of Tech’s players are normally distributed with a mean of 
two hundred twenty pounds and a standard deviation of twenty pounds, what is the 
probability that there will be ten fewer Muskrats at tomorrow’s practice? 

Let the random variables Y\, Y2, ..., Yio denote the weights of the ten players. 


10 
At issue is the probability that Y = )° Y; exceeds twenty-four hundred pounds. But 
i=l 


10 10 
1 1 - 
P (>> - 240 =P (ed aT 200) = P(Y > 240.0) 
i=1 


i=l 
A Z transformation can be applied to the latter expression using the corollary on 
p. 257: 


P(Y > 240.0) =P ¢ See ee =) P(Z > 3.16) 
> i = > = >. 
20//10 = 20//10 


= 0.0008 


Clearly, the chances of a Muskrat splat are minimal. (How much would the 


probability change if eleven players squeezed onto the elevator?) rT] 


Questions 


4.3.21. Econo-Tire is planning an advertising campaign 
for its newest product, an inexpensive radial. Preliminary 
road tests conducted by the firm’s quality-control depart- 
ment have suggested that the lifetimes of these tires will 
be normally distributed with an average of thirty thousand 
miles and a standard deviation of five thousand miles. The 
marketing division would like to run a commercial that 
makes the claim that at least nine out of ten drivers will 
get at least twenty-five thousand miles on a set of Econo- 
Tires. Based on the road test data, is the company justified 
in making that assertion? 


4.3.22. A large computer chip manufacturing plant under 
construction in Westbank is expected to result in an addi- 
tional fourteen hundred children in the county’s public 
school system once the permament workforce arrives. 
Any child with an IQ under 80 or over 135 will require 
individualized instruction that will cost the city an addi- 
tional $1750 per year. How much money should Westbank 
anticipate spending next year to meet the needs of its 
new special ed students? Assume that IQ scores are notr- 
mally distributed with a mean (jw) of 100 and a standard 
deviation (c) of 16. 


4.3.23. Records for the past several years show that the 
amount of money collected daily by a prominent televan- 
gelist is normally distributed with a mean (j) of $20,000 
and a standard deviation (a0) of $5000. What are the 
chances that tomorrow’s donations will exceed $30,000? 


4.3.24. The following letter was written to a well-known 
dispenser of advice to the lovelorn (171): 


Dear Abby: You wrote in your column that a 
woman is pregnant for 266 days. Who said so? I 
carried my baby for ten months and five days, and 
there is no doubt about it because I know the exact 
date my baby was conceived. My husband is in the 
Navy and it couldn’t have possibly been conceived 
any other time because I saw him only once for an 
hour, and I didn’t see him again until the day before 
the baby was born. 

I don’t drink or run around, and there is no way 
this baby isn’t his, so please print a retraction about 
the 266-day carrying time because otherwise I am in 
a lot of trouble. 

San Diego Reader 


Whether or not San Diego Reader is telling the truth is 
a judgment that lies beyond the scope of any statistical 
analysis, but quantifying the plausibility of her story does 
not. According to the collective experience of generations 
of pediatricians, pregnancy durations, Y, tend to be nor- 
mally distributed with p. = 266 days and o = 16 days. Doa 
probability calculation that addresses San Diego Reader’s 
credibility. What would you conclude? 


4.3.25. A criminologist has developed a questionnaire for 
predicting whether a teenager will become a delinquent. 
Scores on the questionnaire can range from 0 to 100, 
with higher values reflecting a presumably greater crimi- 
nal tendency. As a rule of thumb, the criminologist decides 
to classify a teenager as a potential delinquent if his or 
her score exceeds 75. The questionnaire has already been 
tested on a large sample of teenagers, both delinquent and 
nondelinquent. Among those considered nondelinquent, 
scores were normally distributed with a mean (1) of 60 
and a standard deviation (o) of 10. Among those consid- 
ered delinquent, scores were normally distributed with a 
mean of 80 and a standard deviation of 5. 


(a) What proportion of the time will the criminolo- 
gist misclassify a nondelinquent as a delinquent? 
A delinquent as a nondelinquent? 

(b) On the same set of axes, draw the normal curves that 
represent the distributions of scores made by delin- 
quents and nondelinquents. Shade the two areas 
that correspond to the probabilities asked for in 


part (a). 


4.3.26. The cross-sectional area of plastic tubing for use 
in pulmonary resuscitators is normally distributed with 
j= 12.5 mm” and o =0.2 mm’. When the area is less than 
12.0 mm’ or greater than 13.0 mm’, the tube does not fit 
properly. If the tubes are shipped in boxes of one thou- 
sand, how many wrong-sized tubes per box can doctors 
expect to find? 


4.3.27. At State University, the average score of the enter- 
ing class on the verbal portion of the SAT is 565, with a 
standard deviation of 75. Marian scored a 660. How many 
of State’s other 4250 freshmen did better? Assume that 
the scores are normally distributed. 


4.3.28. A college professor teaches Chemistry 101 each 
fall to a large class of freshmen. For tests, she uses stan- 
dardized exams that she knows from past experience pro- 
duce bell-shaped grade distributions with a mean of 70 and 
a standard deviation of 12. Her philosophy of grading is to 
impose standards that will yield, in the long run, 20% A’s, 
26% B’s, 38% C’s, 12% D’s, and 4% F’s. Where should 
the cutoff be between the A’s and the B’s? Between the 
B’s and the C’s? 
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4.3.29. Suppose the random variable Y can be described 
by a normal curve with jz = 40. For what value of o is 


P(20<Y <60)=0.50 


4.3.30. It is estimated that 80% of all eighteen-year- 
old women have weights ranging from 103.5 to 144.5 lb. 
Assuming the weight distribution can be adequately mod- 
eled by a normal curve and that 103.5 and 144.5 are 
equidistant from the average weight jx, calculate o. 


4.3.31. Recall the breath analyzer problem described in 
Example 4.3.5. Suppose the driver’s blood alcohol concen- 
tration is actually 0.09% rather than 0.075%. What is the 
probability that the breath analyzer will make an error in 
his favor and indicate that he is not legally drunk? Sup- 
pose the police offer the driver a choice—either take the 
sobriety test once or take it twice and average the read- 
ings. Which option should a “0.075%” driver take? Which 
option should a “0.09%” driver take? Explain. 


4.3.32. If a random variable Y is normally distributed 
with mean y and standard deviation o, the Z ratio eee 
is often referred to as a normed score: It indicates the 
magnitude of y relative to the distribution from which it 
came. “Norming” is sometimes used as an affirmative- 
action mechanism in hiring decisions. Suppose a cosmetics 
company is seeking a new sales manager. The aptitude test 
they have traditionally given for that position shows a dis- 
tinct gender bias: Scores for men are normally distributed 
with « = 62.0 and o = 7.6, while scores for women are 
normally distributed with ~ = 76.3 and o = 10.8. Laura 
and Michael are the two candidates vying for the posi- 
tion: Laura has scored 92 on the test and Michael 75. If 
the company agrees to norm the scores for gender bias, 
whom should they hire? 


4.3.33. The IQs of nine randomly selected people are 
recorded. Let Y denote their average. Assuming the 
distribution from which the Y;’s were drawn is normal 
with a mean of 100 and a standard deviation of 16, 
what is the probability that Y will exceed 103? What 
is the probability that any arbitrary Y; will exceed 103? 
What is the probability that exactly three of the Y;’s will 
exceed 103? 


4.3.34. Let Y|, Y,..., ¥, be arandom sample from a nor- 
mal distribution where the mean is 2 and the variance is 4. 
How large must n be in order that 


P(1.9<Y<2.1)>0.99 


4.3.35. A circuit contains three resistors wired in series. 
Each is rated at 6 ohms. Suppose, however, that the true 
resistance of each one is a normally distributed random 
variable with a mean of 6 ohms and a standard devia- 
tion of 0.3 ohm. What is the probability that the combined 
resistance will exceed 19 ohms? How “precise” would the 
manufacturing process have to be to make the probability 
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less than 0.005 that the combined resistance of the circuit 
would exceed 19 ohms? 


4.3.36. The cylinders and pistons for a certain internal 
combustion engine are manufactured by a process that 
gives a normal distribution of cylinder diameters with a 
mean of 41.5 cm and a standard deviation of 0.4 cm. Sim- 
ilarly, the distribution of piston diameters is normal with 
a mean of 40.5 cm and a standard deviation of 0.3 cm. 
If the piston diameter is greater than the cylinder diam- 
eter, the former can be reworked until the two “fit.” 


What proportion of cylinder-piston pairs will need to be 
reworked? 


4.3.37. Use moment-generating functions to prove the 
two corollaries to Theorem 4.3.3. 


4.3.38. Let Y\, Y,..., Y» be a random sample of size 
9 from a normal distribution where w = 2 and o =2. 
Let Y*,Y5,..., Ys be an independent random sample 
from a normal distribution having w = 1 and o = 1. Find 
P(Y>Y*). 


4.4 The Geometric Distribution 


Consider a series of independent trials, each having one of two possible outcomes, 
success or failure. Let p = P(Trial ends in success). Define the random variable X to 
be the trial at which the first success occurs. Figure 4.4.1 suggests a formula for the 


pdf of X: 


px(k) = P(X =k) = P (First success occurs on kth trial) 


= P (First k — 1 trials end in failure and kth trial ends in success) 


= P(First k — 1 trials end in failure) - P(kth trial ends in success) 


=(1—p)*"'p, 


k=1,2,... (4.4.1) 


We call the probability model in Equation 4.4.1 a geometric distribution (with 


parameter p). 


Figure 4.4.1 


k—1 failures 


sSSsj_—————— 
F F F S 
1 Qe > 2Re SE 


Independent trials 


wae success 


Comment Even without its association with independent trials and Figure 4.4.1, 
the function px (k) = (1 — p)*"'p, k= 1, 2,... qualifies as a discrete pdf because (1) 
px(k) = 0 for all k and (2) > px(K) =1: 

all k 
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Example 
4.4.1 


(1 p)'p=p)_ py 


j=0 


7 econ 
= i=4=p 


=1 


A pair of fair dice are tossed until a sum of 7 appears for the first time. What is the 
probability that more than four rolls will be required for that to happen? 


Each throw of the dice here is an independent trial for which 


ee Ane! 
= sum = SSS 
- 36 6 
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Let X denote the roll at which the first sum of 7 appears. Clearly, X has the structure 
of a geometric random variable, and 


4 


ch ae 6 
P(X >4)=1- P(X <4)=1 > (2) (;) 


k=l 
a1 7 

1296 
= 0.48 ao 


Theorem Let X have a geometric distribution with px(k) = (1 — p)'"'!p,k=1,2,.... Then 
4.4.1 


2. E(X)=5 


3. Var(X) = a 


Proof See Examples 3.12.1 and 3.12.5 for derivations of My(t) and E(X). The 
formula for Var(X) is left as an exercise. 


Example A grocery store is sponsoring a sales promotion where the cashiers give away one 
4.4.2 of the letters A, E, L, S, U, or V for each purchase. If a customer collects all six 
(spelling VALUES), he or she gets $10 worth of groceries free. What is the expected 
number of trips to the store a customer needs to make in order to get a complete 

set? Assume the different letters are given away randomly. 

Let X; denote the number of purchases necessary to get the ith different letter, 
i=1,2,...,6, and let X denote the number of purchases necessary to qualify for 
the $10. Then X = X; + X2+---+ Xo (see Figure 4.4.2). Clearly, X; equals 1 with 
probability 1, so E(X,) = 1. Having received the first letter, the chances of getting a 
different one are ? for each subsequent trip to the store. Therefore, 


ees 
fxn (k) = P(X2. =k) =| > =; Bade 2ee0) 
6 6 
Second Third Sixth 
First different different different 
letter letter letter letter 
Trips 1 1 2 1 2 3s: 1. “2 

ed rs eS 
x, x, x; Xo 

X 

Figure 4.4.2 


That is, X2 is a geometric random variable with parameter p= ?. By Theorem 4.4.1, 
E(X>) = &. Similarly, the chances of getting a third different letter are 7 (for each 


5 
ay" 74 
fist) = P=) = (2) (3). k=1,2,... 


purchase), so 
6 6 
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and E(X3) = s Continuing in this fashion, we can find the remaining E(X;)’s. It 
follows that a customer will have to make /4.7 trips to the store, on the average, to 
collect a complete set of six letters: 
6 
E(X)=)° E(X;) 
i=l 
pence ees. 
ee SB eB De 
= 14.7 = 
Questions 


4.4.1. Because of her past convictions for mail fraud and 
forgery, Jody has a 30% chance each year of having her 
tax returns audited. What is the probability that she will 
escape detection for at least three years? Assume that she 
exaggerates, distorts, misrepresents, lies, and cheats every 
year. 


4.4.2. A teenager is trying to get a driver’s license. Write 
out the formula for the pdf p,(k), where the random vari- 
able X is the number of tries that he needs to pass the 
road test. Assume that his probability of passing the exam 
on any given attempt is 0.10. On the average, how many 
attempts is he likely to require before he gets his license? 


4.4.3. Is the following set of data likely to have come 
from the geometric pdf py(k) = ee -(4), k= 1,2,...? 
Explain. 


28 122 5 12 8 3 
5 42472 2 8 4 7 
262 3 5 13 3 2°55 
422 3 63 64 9 3 
3.7 5 13 4 3 4 6 2 


4.4.4. Recently married, a young couple plans to continue 
having children until they have their first girl. Suppose 
the probability that a child is a girl is +, the outcome of 
each birth is an independent event, and the birth at which 
the first girl appears has a geometric distribution. What is 
the couple’s expected family size? Is the geometric pdf a 


reasonable model here? Discuss. 


4.4.5. Show that the cdf for a geometric random vari- 
able is given by Fy(t) = P(X <t) =1-(1— p)™, where 
[t] denotes the greatest integer in r, t > 0. 


4.4.6. Suppose three fair dice are tossed repeatedly. Let 
the random variable X denote the roll on which a sum of 
4 appears for the first time. Use the expression for F;, (f) 
given in Question 4.4.5 to evaluate P(65 < X <75). 


4.4.7. Let Y be an exponential random variable, where 
fr(y) =ae”’, 0 < y. For any positive integer n, show 
that P(in< ¥ <n+1)=e "(1 —e%). Note that if p= 
1—e, the “discrete” version of the exponential pdf is 
the geometric pdf. 


4.4.8. Sometimes the geometric random variable is 
defined to be the number of trials, X, preceding the first 
success. Write down the corresponding pdf and derive the 
moment-generating function for X two ways—(1) by eval- 
uating E(e'*) directly and (2) by using Theorem 3.12.3. 


4.4.9. Differentiate the moment-generating function for 
a geometric random variable and verify the expressions 
given for E(X) and Var(X) in Theorem 4.4.1. 


4.4.10. Suppose that the random variables X, and X, have 


dg 
dM t= ae an 
5 an x5 ( ) 1-(1-})e’ 
tively. Let X = X, + X2. Does X have a geometric distri- 
bution? Assume that X, and X, are independent. 


lot 
mgfs My, (t) = mee 


Sie. respec- 
5 


4.4.11. The factorial moment-generating function for any 
random variable W is the expected value of t”. More- 
over “E(t") |... = E[W(W — 1)---(W—r+1)]. Find the 
factorial moment-generating function for a geometric ran- 
dom variable and use it to verify the expected value and 
variance formulas given in Theorem 4.4.1. 


4.5 The Negative Binomial Distribution 


The geometric distribution introduced in Section 4.4 can be generalized in a very 
straightforward fashion. Imagine waiting for the rth (instead of the first) success in 
a series of independent trials, where each trial has a probability of p of ending in 


success (see Figure 4.5.1). 


Figure 4.5.1 


Theorem 
4.5.1 


4.5 The Negative Binomial Distribution 263 


r—1successes and k — 1 —(r-1) failures 
——- oO 


S F F S S 
1 2 3 ~ ° kel # 


rth success 
Independent trials 


Let the random variable X denote the trial at which the rth success occurs. Then 
Px (k) = P(X =k) = P(rth success occurs on kth trial) 
= P(r —1 successes occur in first k — 1 trials and 
success occurs on kth trial) 
= P(r — 1 successes occur in first k — 1 trials) 
- P(Success occurs on kth trial) 


k-1 
— ( )rta = pes) -p 


r—-1 


k-1 
=( i) ed kant, (4.5.1) 
—_ 


Any random variable whose pdf has the form given in Equation 4.5.1 is said to have 
a negative binomial distribution (with parameter p). 


Comment Two equivalent formulations of the negative binomial structure are 
widely used. Sometimes X is defined to be the number of trials preceding the rth 
success; other times, X is taken to be the number of trials in excess of r that are nec- 
essary to achieve the rth success. The underlying probability structure is the same, 
however X is defined. We will primarily use Equation 4.5.1; properties of the other 
two definitions for X will be covered in the exercises. 


Let X have a negative binomial distribution with px(k) = (=) p’.d— py, k=r, 
r+l,.... Then 

2 E(X)= = 

3. Var(X) = ae 


Proof All of these results follow immediately from the fact that X can be written 
as the sum of r independent geometric random variables, X1, X2,..., X,, each with 
parameter p. That is, 


X = total number of trials to achieve rth success 
= number of trials to achieve Ist success 
+ number of additional trials to achieve 2nd success + - - - 
+ number of additional trials to achieve rth success 
=X, +X2.+---+X, 
where 


px()=U— py p, k=1,2,..., i=1,2,...,7 
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Example 


4.5.1 


Therefore, 


Mx (t) = Mx, (t)My, (t)... Mx, (0) 


ee ee 
~ (id= pe 
Also, from Theorem 4.4.1, 
E(X)= E(X1) + E(X2) +--+ + E(X,) 
1 1 1 


Se eh ae 
P P 


and 
Var(X) = Var(X,) + Var(X2) +---+ Var(X;) 
sep LSP l-p 


p pe pe 


The California Mellows are a semipro baseball team. Eschewing all forms of vio- 
lence, the laid-back Mellow batters never swing at a pitch, and should they be 
fortunate enough to reach base on a walk, they never try to steal. On the aver- 
age, how many runs will the Mellows score in a nine-inning road game, assuming 
the opposing pitcher has a 50% probability of throwing a strike on any given pitch 
(83)? 

The solution to this problem illustrates very nicely the interplay between the 
physical constraints imposed by a question (in this case, the rules of baseball) 
and the mathematical characteristics of the underlying probability model. The 
negative binomial distribution appears twice in this analysis, along with sev- 
eral of the properties associated with expected values and linear combina- 
tions. 

To begin, we calculate the probability of a Mellow batter striking out. Let the 
random variable X denote the number of pitches necessary for that to happen. 
Clearly, X = 3, 4, 5, or 6 (why can X not be larger than 6?), and 


Dx(k) = P(X =k) = P(2 strikes are called in the first k — 1 
pitches and the kth pitch is the 3rd strike) 


TET ly 
= ~) (=) , k=3,4,5,6 
2 oe N28 


Therefore, 


° Te F3N SINS ANAT) oo S\ Ie 
P (Batter strikes out)=) pxt)= (5) +( )G) +( )G) +( )G) 
rae 2 2) \2 2) \2 2) \2 


21 
~ 32 
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Now, let the random variable W denote the number of walks the Mellows get in 
a given inning. In order for W to take on the value w, exactly two of the first w + 2 
batters must strike out, as must the (w + 3)rd (see Figure 4.5.2). The pdf for W, then, 
is a negative binomial with p = P (Batter strikes out) = oe 


(31? (it \" 
pw(u)=Pw=w)=(" > )(3) (5) oe eee 


2 outs, w walks 
—umm —On—-—s9 
Out 


1 2 3. °° wel wt+2 wt+3 


Batters 


Figure 4.5.2 


In order for a run to score, the pitcher must walk a Mellows batter with the 
bases loaded. Let the random variable R denote the total number of runs walked in 
during a given inning. Then 


oe: 0 ifw <3 
~ | w-3 ifw>3 


and 


= waa IVE FAL 
B@= =Yw-9("2")(=) (&) 
» ?) a) Nap 


w= 


fone) 3 
=>) (w—3)-P(W=w)- > (w-3): PW=w) 
w=0 


w=0 


i 7 21N f 11S" 
=£W)-34 G-w-("> )(3) (5) (4.5.2) 


w=0 


To evaluate E(W) using the statement of Theorem 4.5.1 requires a linear 
transformation to rescale W to the format of Equation 4.5.1. Let 


T = W +3 = total number of Mellow batters appearing in a given inning 


_ Oe Coe ay iy Te 
pr=pwt-9=("5')(S) (EG). 234 


which we recognize as a negative binomial pdf with r =3 and p= a. Therefore, 


Then 


E(T)= = 


which makes E(W) = E(T) -3= 2 -3=4. 
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From Equation 4.5.2, then, the expected number of runs scored by the Mellows 
in a given inning is 0.202: 


Each of the nine innings, of course, would have the same value for E(R), so the 
expected number of runs in a game is the sum 0.202 + 0.202 + --- + 0.202 = 9(0.202), 
or 1.82. a 


Case Study 4.5.1 


Natural phenomena that are particularly complicated for whatever reasons may 
be impossible to describe with any single, easy-to-work-with probability model. 
An effective Plan B in those situations is to break the phenomenon down into 
simpler components and simulate the contributions of each of those compo- 
nents by using randomly generated observations. These are called Monte Carlo 
analyses, an example of which is described in detail in Section 4.7. 

The fundamental requirement of any simulation technique is the ability to 
generate random observations from specified pdfs. In practice, this is done using 
computers because the number of observations needed is huge. In principle, 
though, the same, simple procedure can be used, by hand, to generate random 
observations from any discrete pdf. 

Recall Example 4.5.1 and the random variable W, where W is the number of 
walks the Mellow batters are issued in a given inning. It was shown that pw(w) 
is the particular negative binomial pdf, 


a \ 7 it? 
pw(u)=PW=w)=( (5) (5) , w=0,1,2,... 


Suppose a record is kept of the numbers of walks the Mellow batters receive in 
each of the next one hundred innings the team plays. What might that record 
look like? 

The answer is, the record will look like a random sample of size 100 drawn 
from pw(w). Table 4.5.1 illustrates a procedure for generating such a sample. 
The first two columns show pyw(w) for the nine values of w likely to occur (0 
through 8). The third column parcels out the one hundred digits 00 through 99 
into nine intervals whose lengths correspond to the values of pw(w). 

There are twenty-nine two-digit numbers, for example, in the interval 28 to 
56, with each of those numbers having the same probability of 0.01. Any random 
two-digit number that falls anywhere in that interval will then be mapped into 
the value w = | (which will happen, in the long run, 29% of the time). 

Tables of random digits are typically presented in blocks of twenty-five (see 
Figure 4.5.3). 


(Continued on next page) 


4.5 The Negative Binomial Distribution 267 


Table 4.5.1 
Random Number 
ws Pw(w) Range 
0 0.28 00-27 
1 0.29 28-56 
2 0.20 57-76 
3 0.11 771-87 
4 0.06 88-93 
5 0.03 94-96 
6 0.01 97 
7 0.01 98 
8+ 0.01 99 
23107 15053 39098 
65402 70659 84864 
75528 18738 05624 
85830 56869 15227 
13300 08158 48968 
75604 02011 
01188 85393 
71585 97265 
23495 61680 
51851 16656 
Figure 4.5.3 
w+2\(21\3 (/11\” 
0.30 pon w \(S) (5) a ale 
eo 
= 020 
2 I 
E | 
P 0.10 


0 
0 1 2 3 4 5 6 7 8 
Number of walks, W 
Figure 4.5.4 


For the particular block circled, the first two columns, 


22 17 83 57 27 
would correspond to the negative binomial values 
00 3 2 0 


(Continued on next page) 
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(Case Study 4.5.1 continued) 


Figure 4.5.4 shows the results of using a table of random digits and 
Table 4.5.1 to generate a sample of one hundred random observations from 
pw(w). The agreement is not perfect (as it shouldn’t be), but certainly very good 
(as it should be). 


About the Data Random number generators for continuous pdfs use random dig- 
its in ways that are much different from the strategy illustrated in Table 4.5.1 and 
much different from each other. The standard normal pdf and the exponential pdf 
are two cases in point. 

Let U;, U2, ... be a set of random observations drawn from the uniform pdf 
defined over the interval [0, 1]. Standard normal observations are generated by 
appealing to the central limit theorem. Since each U; has E(U;) = 1/2 and Var(U;) 
= 1/12, it follows that 

k 
EQ LU) =k/2 
i=l 


and 


k 
Var() > U;) =k/12 


i=1 


and by the central limit theorem, 


> U; —k/2 
i=1 ay 4 
Jk/T2 


The approximation improves as k increases, but a particularly convenient (and 
sufficiently large) value is k = 12. The formula for generating a standard normal 
observation, then, reduces to 


12 
Z=) U;-6 
i=1 


Once a set of Z;’s has been calculated, random observations from any normal 
distribution can be easily produced. Suppose the objective is to generate a set of 
Y;’s that would be a random sample from a normal distribution having mean jw and 
variance o7. Since 


or, equivalently, 
Y=pu+oZ 
it follows that the random sample from fy(y) would be 
Y,=ywt+oZ;,i=1,2,... 


By way of contrast, all that is needed to generate random observations from the 
exponential pdf, fy(y) =Ae~*”, y = 0, is a simple transformation. If U;,i=1,2,..., 
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is a set of uniform random variables as defined earlier, then Y; = —(1/A)In U;, i= 
1,2,..., will be the desired set of exponential observations. Why that should be so 
is an exercise in differentiating the cdf of Y. By definition, 


Fy(y)= P(Y < y)=P(InU > —Ay)= P(UU >e””) 


which implies that 


Questions 


4.5.1. A door-to-door encyclopedia salesperson is 
required to document five in-home visits each day. Sup- 
pose that she has a 30% chance of being invited into any 
given home, with each address representing an indepen- 
dent trial. What is the probability that she requires fewer 
than eight houses to achieve her fifth success? 


4.5.2. An underground military installation is fortified to 
the extent that it can withstand up to three direct hits 
from air-to-surface missiles and still function. Suppose an 
enemy aircraft is armed with missiles, each having a 30% 
chance of scoring a direct hit. What is the probability that 
the installation will be destroyed with the seventh missile 
fired? 


4.5.3. Darryl’s statistics homework last night was to flip 
a fair coin and record the toss, X, when heads appeared 
for the second time. The experiment was to be repeated 
a total of one hundred times. The following are the one 
hundred values for X that Darryl turned in this morn- 
ing. Do you think that he actually did the assignment? 
Explain. 


SF. Be 2D) Oh Bu AB 3D, 
7 <3. 8-4 Bo 3 33.4 3° 3 
4.3.2 2 4 $2, 2.2. 4 
25 642 62 8 3 2 
8 23 2 4 3 2 6 3 3 
3.2 5 3 6 4 5 6 5 6 
3.5 2 7 2 10 4 3 2 2 
4245 5 5 62 4 3 
34 4--0. 53) 4° 2. 45°-3. 2 
5 7:35 3.2 7 4-4 4 °3 


4.5.4. When a machine is improperly adjusted, it has 
probability 0.15 of producing a defective item. Each day, 
the machine is run until three defective items are pro- 
duced. When this occurs, it is stopped and checked for 
adjustment. What is the probability that an improperly 
adjusted machine will produce five or more items before 


1 du=1-—e 


fr) = Fy(y) =Ae, v0 


being stopped? What is the average number of items an 
improperly adjusted machine will produce before being 
stopped? 


4.5.5. For a negative binomial random variable whose pdf 
is given by Equation 4.5.1, find E(X) directly by evaluat- 


ing 0k (*}) p'U. — p)*". (Hint: Reduce the sum to one 
k=r 


r—-1 
involving negative binomial probabilities with parameters 
r+1and p.) 


4.5.6. Let the random variable X denote the number 
of trials in excess of r that are required to achieve 
the rth success in a series of independent trials, where 
p is the probability of success at any given trial. 
Show that 


k+r-1 


| k 


) pay. k=0,1,2,... 


[Note: This particular formula for px(k) is often used in 
place of Equation 4.5.1 as the definition of the pdf for a 
negative binomial random variable. | 


4.5.7. Calculate the mean, variance, and moment- 
generating function for a negative binomial random vari- 
able X whose pdf is given by the expression 


k+r- 


putt) =( k 


1 k 
p—p), k=0,1,2,... 


(See Question 4.5.6.) 


4.5.8. Let X,, X2, and X; be three independent negative 
binomial random variables with pdfs 


me tae ie cle k=3,4,5 
ps t= ( 2 G) (=) , Dp P55, Se 


for i=1,2,3. Define X = X,+ X,4+ X3. Find PU0O<X < 
12). (Hint: Use the moment-generating functions of X,, 
X,, and X; to deduce the pdf of X.) 
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4.5.9. Differentiate the moment-generating function 4.5.10. Suppose that X,, X,..., X;, are independent neg- 


Mx(t) =[ 


pe! 


1—(1—p)et 


rem 4.5.1 for E(X). 


Figure 4.6.1 


Theorem 
4.6.1 


| to verify the formula given in Theo- 


ative binomial random variables with parameters r, and 
P,r2 and p,..., and r, and p, respectively. Let X = X;+ 
X)+-+-+X;,. Find My(t), px(t), E(X), and Var(X). 


4.6 The Gamma Distribution 


Suppose a series of independent events are occurring at the constant rate of 1 per 
unit time. If the random variable Y denotes the interval between consecutive occur- 
rences, we know from Theorem 4.2.3 that fy(y) =Ae~*’, y > 0. Equivalently, Y can 
be interpreted as the “waiting time” for the first occurrence. This section gener- 
alizes the Poisson/exponential relationship and focuses on the interval, or waiting 
time, required for the rth event to occur (see Figure 4.6.1). 


Y 
ee 
ef .-- f$——_x—> Time 
0 First Second rth 

success success success 


Suppose that Poisson events are occurring at the constant rate of i. per unit time. Let 
the random variable Y denote the waiting time for the rth event. Then Y has pdf fy(y), 
where 


r 


r—1)—aAy 
= 


e”, y>O0 


fyro= 


Proof We will establish the formula for fy(y) by deriving and differentiating its cdf, 
Fy(y). Let Y denote the waiting time to the rth occurrence. Then 


Fy(y)= PY <y)=1-P(>y) 
= 1 — P(Fewer than r events occur in [0, y]) 
r-1 


=1- re Ot 


k! 
k=0 


since the number of events that occur in the interval [0, y] is a Poisson random 
variable with parameter Ay. 


From Theorem 3.4.1, 
r-1 
d _ yy Ay)* 
=Fi(y)=—]1- Ay es 
fr) = Fy) a de ‘1 

r-1 r-1 az 

Sy A! ne (yy 
= k! = (k—1)! 
<i yy yO 

= ke yore? 
k=0 Kk} k=0 Kk} 

Xr 
— yt y. y> 0 


Example 
4.6.1 


Theorem 
4.6.2 
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Engineers designing the next generation of space shuttles plan to include two fuel 
pumps—one active, the other in reserve. If the primary pump malfunctions, the 
second will automatically be brought on line. 

Suppose a typical mission is expected to require that fuel be pumped for 
at most fifty hours. According to the manufacturer’s specifications, pumps are 
expected to fail once every one hundred hours (so A = 0.01). What are the chances 
that such a fuel pump system would not remain functioning for the full fifty 
hours? 

Let the random variable Y denote the time that will elapse before the second 
pump breaks down. According to Theorem 4.6.1, the pdf for Y has parameters r = 2 
and 4 =0.01, and we can write 


(0. we yerO0ly 


fyQ) = —— y>0 


Therefore, 


50 
P (System fails to last for fifty hours) = / 0.0001 ye~°!” dy 
0 


0.50 
= / ue “du 
0 


where u=0.01y. The probability, then, that the primary pump and its backup would 
not remain operable for the targeted fifty hours is 0.09: 


0.50 050 
i ue“ du= (—u— l)e mi ine 
0 


= 0.09 = 


Generalizing the Waiting Time Distribution 


By virtue of Theorem 4.6.1, iy '-le~-y dy converges for any integer r > 0. But 
the convergence also holds for any on number r > 0, because for any such r there 
will be an integer f>r and f>° y" le dy < [>~ y'" |e” dy < ow. The finiteness of 
ic y’-'e~* dy justifies the consideration of a related definite integral, one that was 
first studied by Euler, but named by Legendre. 


Definition 4.6.1. For any real number r > 0, the gamma function of r is denoted 
I(r), where 


oe) 
rr) = y’ te dy 
0 


Let V(r) = [-° y’-!e-” dy for any real number r > 0. Then 


I. Td)=1 
2 Pn=r-Drr-) 
3. Ifr is an integer, then T(r) =(r — 1)! 
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Proof 


1 T= fp yl te dy= fp ee dy=1 
2. Integrate the gamma function by parts. Let w= y’~' and dv =e~”. Then 


oe) x CO 
/ y te dy= -y te |) +f (r— l)y’7 e dy 
0 0 


=(r- » fy teray= er Drir-b 
0 


3. Use part (2) as the basis for an induction argument. The details will be left as 
an exercise. 


Definition 4.6.2. Given real numbers r > 0 and 4 > 0, the random variable Y is 
said to have the gamma pdf with parameters r and 4 if 
a 
I(r) 


yee y >0 


frOoy= 


Comment To justify Definition 4.6.2 requires a proof that fy(y) integrates to 1. Let 
u=Ay. Then 


oo yr edge, nr CF y\r-l a 1 
y ve Vdy= (*) e “—du 
a: LG) P(r) Jo \A r 


Theorem Suppose that Y has a gamma pdf with parameters r and 2d. Then 


4.6.3 
1 E(Y)=r/r 
2. Var(Y)=r/i? 


Proof 
CO yes nv CO 
1. Ea) | Tle“ dy = / re’ d 
o Te? SPO) doo” : 
MoT 1 fe yt 
= GE) ye” dy 
T(r) atl Jo Tart) 
_ rv UO) aya mn 
“Tey art 7 =F 


2. A calculation similar to the integration carried out in part (1) shows that 
E(Y?)=r(r+1)/d2. Then 


Var(Y) = E(Y”) —[E(Y)P 
=r(r+1)/a? —(r/a)?? 
=p /a? 


Theorem 
4.6.4 


Theorem 
4.6.5 
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Sums of Gamma Random Variables 


We have already seen that certain random variables satisfy an additive property that 
“reproduces” the pdf—the sum of two independent binomial random variables with 
the same p, for example, is binomial (recall Example 3.8.2). Similarly, the sum of 
two independent Poissons is Poisson and the sum of two independent normals is 
normal. That said, most random variables are not additive. The sum of two inde- 
pendent uniforms is not uniform; the sum of two independent exponentials is not 
exponential; and so on. Gamma random variables belong to the short list making up 
the first category. 


Suppose U has the gamma pdf with parameters r and 4, V has the gamma pdf with 
parameters s and i, and U and V are independent. Then U + V has a gamma pdf with 
parameters r +s and i. 


Proof The pdf of the sum is the convolution integral 


furvio= f fu) fy (t —u)du 


t r Ss 
= us yu} eh us (t = uy! et) du 
0 


rr) (s) 
Arts t 
Se gt ay dae 
r(r)P(s) Jo 


Make the substitution v = u/t. Then the integral becomes 


t 1 
ro} ean vad _ vy! dv= pose! i; va _ "lle dv 
0 


0 
and 


1 
fusv(t) Harts pots! eM aS i vd =a] (4.6.1) 


The numerical value of the constant in brackets in Equation 4.6.1 is not imme- 
diately obvious, but the factors in front of the brackets correspond to the functional 
part of a gamma pdf with parameters r +s and dX. It follows, then, that fy+y(t) must 
be that particular gamma pdf. It also follows that the constant in brackets must equal 
1/[(@ +s) (to comply with Definition 4.6.2), so, as a “bonus” identity, Equation 4.6.1 
implies that 


Pw) (s) 


1 
r-1 1= s-lg = 
i uv ( v) v Tele. 


If Y has a gamma pdf with parameters r and i, then My(t)=(1—t/A)°. 


Proof 
Myn= Ee") = [Oe ie yt e dy= 7a is yh OOD ay 
0 r(r) (r) 0 
a PO f° BAN y gtcmgy 
Prya-ty Jo TO) 
ar 


(Q=d-t/ay7 


"Ga 
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Questions 


4.6.1. An Arctic weather station has three electronic wind 
gauges. Only one is used at any given time. The life- 
time of each gauge is exponentially distributed with a 
mean of one thousand hours. What is the pdf of Y, the 
random variable measuring the time until the last gauge 
wears out? 


4.6.2. A service contact on a new university computer 
system provides twenty-four free repair calls from a tech- 
nician. Suppose the technician is required, on the average, 
three times a month. What is the average time it will take 
for the service contract to be fulfilled? 


4.6.3. Suppose a set of measurements Y,, Y2,..., Yioo is 
taken from a gamma pdf for which E(Y) = 1.5 and 
Var(Y) = 0.75. How many Y;’s would you expect to find 
in the interval [1.0, 2.5]? 


4.6.4. Demonstrate that 4 plays the role of a scale param- 
eter by showing that if Y is gamma with parameters r and 
A, then AY is gamma with parameters r and 1. 


4.6.5. Show that a gamma pdf has the unique mode rt; 
that is, show that the function f;(y) = oa y’te takes its 
maximum value at Ymode = rt and at no other point. 


4.6.6. Prove that ['(+) = /x. [Hint: Consider E(Z’), 
where Z is a standard normal random variable. ] 


4.6.7. Show that (2) = 2 /z. 
4.6.8. If the random variable Y has the gamma pdf with 


integer parameter r and arbitrary 4 > 0, show that 
(m+r—1)! 

(r— Dan 
[Hint: Use the fact that f,* y’~'e-” dy = (r — 1)! when r is 
a positive integer. | 


E(Y")= 


4.6.9. Differentiate the gamma moment-generating func- 
tion to verify the formulas for E(Y) and Var(Y) given in 
Theorem 4.6.3. 


4.6.10. Differentiate the gamma moment-generating 
function to show that the formula for E(Y”) given in 
Question 4.6.8 holds for arbitrary r > 0. 


4.7 Taking a Second Look at Statistics (Monte 


Carlo Simulations) 


Calculating probabilities associated with (1) single random variables and (2) func- 
tions of sets of random variables has been the overarching theme of Chapters 3 
and 4. Facilitating those computations has been a variety of transformations, sum- 
mation properties, and mathematical relationships linking one pdf with another. 
Collectively, these results are enormously effective. Sometimes, though, the intrinsic 
complexity of a random variable overwhelms our ability to model its probabilis- 
tic behavior in any formal or precise way. An alternative in those situations is 
to use a computer to draw random samples from one or more distributions that 
model portions of the random variable’s behavior. If a large enough number of 
such samples is generated, a histogram (or density-scaled histogram) can be con- 
structed that will accurately reflect the random variable’s true (but unknown) 
distribution. Sampling “experiments” of this sort are known as Monte Carlo 


studies. 


Real-life situations where a Monte Carlo analysis could be helpful are not 
difficult to imagine. Suppose, for instance, you just bought a state-of-the-art, high- 
definition, plasma screen television. In addition to the pricey initial cost, an optional 
warranty is available that covers all repairs made during the first two years. Accord- 
ing to an independent laboratory’s reliability study, this particular television is 
likely to require 0.75 service call per year, on the average. Moreover, the costs 
of service calls are expected to be normally distributed with a mean (w) of $100 
and a standard deviation (o) of $20. If the warranty sells for $200, should you 


buy it? 


Figure 4.7.1 


Figure 4.7.2 
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Like any insurance policy, a warranty may or may not be a good investment, 
depending on what events unfold, and when. Here the relevant random variable is 
W, the total amount spent on repair calls during the first two years. For any partic- 
ular customer, the value of W will depend on (1) the number of repairs needed in 
the first two years and (2) the cost of each repair. Although we have reliability and 
cost assumptions that address (1) and (2), the two-year limit on the warranty intro- 
duces a complexity that goes beyond what we have learned in Chapters 3 and 4. 
What remains is the option of using random samples to simulate the repair costs 
that might accrue during those first two years. 

Note, first, that it would not be unreasonable to assume that the service calls 
are Poisson events (occurring at the rate of 0.75 per year). If that were the case, 
Theorem 4.2.3 implies that the interval, Y, between successive repair calls would 
have an exponential distribution with pdf 


fy(y) =0.75e".”, y>0 


(see Figure 4.7.1). Moreover, if the random variable C denotes the cost associated 
with a particular maintenance call, then, 


pe —(4)(e-100)/207 


—-COO <C<C 


i 
J/27 (20) = 


(see Figure 4.7.2). 


40 100 160 


Now, with the pdfs for Y and C fully specified, we can use the computer to 
generate representative repair cost scenarios. We begin by generating a random 
sample (of size 1) from the pdf, f(y) =0.75e~°->”. Either of two equivalent Minitab 
procedures can be followed: 
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Figure 4.7.3 


Figure 4.7.4 


Session Window Method Menu-Driven Method 

Click on EDITOR, then on Click on CALC, then on RANDOM DATA, 
ENABLE COMMANDS then on EXPONENTIAL. 

(this activates the Type 1 in “Number of rows” box; 


Session Window). Then type OR type 1.33 in “Scale” box; 
type cl in “Store” box. 


MTB > random 1 cl; Click on OK. 
SUBC > exponential 1.33. The generated exponential deviate 
MTB > print cl appears in the upper left hand 


corner of 
the WORKSHEET. 


Data Display 


cl 
1.15988 


(Note: For both methods, Minitab uses 1// as the exponential parameter. Here, 1/A 
= 1/0.75 = 1.33.) 

As shown in Figure 4.7.3, the number generated was 1.15988 yrs (correspond- 
ing to a first repair call occurring 423 days (= 1.15988 x 365) after the purchase of 
the TV). 


0.8 
0.6 FS. 

‘ y 
0.4 we fy ) 
0.2 Sig 


~~ 


0 1 x 2 3 4 
y = 1.15988 


Applying the same syntax a second time yielded the random sample 0.284931 
year (= 104 days); applying it still a third time produced the observation 1.46394 
years (= 534 days). These last two observations taken on fy(y) correspond to the 
second repair call occurring 104 days after the first, and the third occurring 534 days 
after the second (see Figure 4.7.4). Since the warranty does not extend past the first 
730 days, the third repair would not be covered. 


3rd breakdown (y = 1.46394) 
423 days 104 days 534 days repair cost not covered 


Purchase 365 730 Time after 
day Warranty ends purchase (days) 


1st breakdown (y = 1.15988) 2nd breakdown (y = 0.284931) 
repair cost = $127.20 repair cost = $98.67 


The next step in the simulation would be to generate two observations from 
fc(c) that would model the costs of the two repairs that occurred during the war- 
ranty period. The session-window syntax for simulating each repair cost would be 
the statements 

MTB > random 1 cl; 
SUBC > normal 100 20. 
MTB > print cl 
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MTB > random 1 cl; 0.8 
SUBC > exponential 1.33. 0.6 FSS 
. Ree FO) 
MTB > print cl 0.4 a 
cl 02 Pats 
eES988° 02020222020 RR i 


MTB > random 1 cl; rales 
SUBC > normal 100 20. ff \ = £0) 
MTB > print cl (70001 
onl 
127.199 
60 100 140 
MTB > random 1 cl; 0.8 
SUBC > exponential 1.33. 0.6 oN £0) 
MTB > print cl 0.4 i 
conil 02 wee, 
O,284937 220 Fano ow See ee 7 
MTB > random 1 cl; Pala 
SUBC > normal 100 20. i \ f-(c) 
G 
MTB > print cl fool <—" 
onl 
98.6673 
60 100 140 
MTB > random 1 cl; 0.8 
SUBC > exponential 1.33. 0.6 FY. 
: \ fyO) 
MTB > print cl 0.4 a 
cl 0.2 ce 
WnA6394 0000 ee ee a RS ee 


Figure 4.7.5 


Running those commands twice produced c-values of 127.199 and 98.6673 (see 
Figure 4.7.5), corresponding to repair bills of $127.20 and $98.67, meaning that a 
total of $225.87 (= $127.20 + $98.67) would have been spent on maintenance dur- 
ing the first two years. In that case, the $200 warranty would have been a good 
investment. 

The final step in the Monte Carlo analysis is to repeat many times the sam- 
pling process that led to Figure 4.7.5—that is, to generate a series of y;’s whose sum 
(in days) is less than or equal to 730, and for each y; in that sample, to generate 
a corresponding cost, c;. The sum of those c;’s becomes a simulated value of the 
maintenance-cost random variable, W. 
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Figure 4.7.6 


25 m = $117.00 
20 y = $159.10 


Frequency 


0 $100 $200 $300 $400 $500 $600 
Warranty cost ($200) Simulated repair costs 


The histogram in Figure 4.7.6 shows the distribution of repair costs incurred in 
one hundred simulated two-year periods, one being the sequence of events detailed 
in Figure 4.7.5. There is much that it tells us. First of all (and not surprisingly), the 
warranty costs more than either the median repair bill (= $117.00) or the mean 
repair bill (= $159.10). 

The customer, in other words, will tend to lose money on the optional protec- 
tion, and the company will tend to make money. On the other hand, a full 33% of 
the simulated two-year breakdown scenarios led to repair bills in excess of $200, 
including 6% that were more than twice the cost of the warranty. At the other 
extreme, 24% of the samples produced no maintenance problems whatsoever; for 
those customers, the $200 spent up front is totally wasted! 

So, should you buy the warranty? Yes, if you feel the need to have a financial 
cushion to offset the (small) probability of experiencing exceptionally bad luck; no, 
if you can afford to absorb an occasional big loss. 


Appendix 4.A.1 Minitab Applications 


Figure 4.A.1.1 


Examples at the end of Chapter 3 and earlier in this chapter illustrated the use of 
Minitab’s PDF, CDF, and INVCDF commands on the binomial, exponential, and 
normal distributions. Altogether, those same commands can be applied to more than 
twenty of the probability distributions most frequently encountered, including the 
Poisson, geometric, negative binomial, and gamma pdfs featured in Chapter 4. 

Recall the leukemia cluster study described in Case Study 4.2.1. The data’s inter- 
pretation hinged on the value of P(X > 8), where X was a Poisson random variable 
with pdf, py(k) = glee. k=0,1,2,.... The printout in Figure 4.A.1.1 shows 
the calculation of P(X > 8) using the CDF command and the fact that P(X > 8) = 
1— P(X <7). 


MTB > cdf 7; 
SUBC > poisson 1.75. 


Cumulative Distribution Function 
Poisson with mean = 1.75 
x P(X <= x) 

0.999532 


MTB > let kl = 1 - 0.999532 
MTB > print kl 


Data Display 


k1 0.000468000 


Figure 4.A.1.2 


Figure 4.A.1.3 
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Areas under normal curves between points a and b are calculated by sub- 
tracting Fy(a) from Fy(b), just as we did in Section 4.3 (recall the Comment 
after Definition 4.3.1). There is no need, however, to reexpress the probability 
as an area under the standard normal curve. Figure 4.A.1.2 shows the Minitab 
calculation for the probability that the random variable Y lies between 48 and 
51, where Y is normally distributed with ~ = 50 and o =4. According to the 
computer, 


P(48 <Y <51)=Fy(51) — Fy (48) 
= 0.598706 — 0.308538 
= 0.290168 


MTB > cdf 51; 
SUBC> normal 50 4. 


Cumulative Distribution Function 

Normal with mean = 50 and standard deviation = 4 
x P( X <= x) 
51 0.598706 

MTB > cdf 48; 

SUBC> normal 50 4. 


Cumulative Distribution Function 

Normal with mean = 50.0000 and standard deviation = 4.00000 
x P( X <= x) 
48 0.308538 

MTB > let kl = 0.598706—0.308538 

MTB > print kl 

Data Display 

k1 0.290168 


On several occasions in Chapter 4 we made use of Minitab’s RANDOM com- 
mand, a subroutine that generates samples from a specific pdf. Simulations of that 
sort can be very helpful in illustrating a variety of statistical concepts. Shown in 
Figure 4.A.1.3, for example, is the syntax for generating a random sample of size 50 
from a binomial pdf having n = 60 and p=0.40. And calculated for each of those 
fifty observations is its Z ratio, given by 


X-—E(X)_ X—60(0.40) | X-24 
JVar(X)  /60(0.40)(0.60) 14.4 


[By the DeMoivre-Laplace theorem, of course, the distribution of those ratios 
should have a shape much like the standard normal pdf, fz (z).] 


Z-ratio = 


MTB > random 50 cl; 
SUBC> binomial 60 0.40. 
MRB > print cl 


Data Display 

Cl 
27 29 23 22 21 21 22 26 26 20 26 25 27 
32) 622» 27" “22 20% 219. “19% <2) “230 “28: <23~ 27 29) 
13 24 22 25 25 20 25 26 15 24 17 28 21 
T6. 24 22) «25 325 . 222-23 223° 20. 25. 30 


MTB > let c2 = (cl - 24)/sqrt(14.4) 
MTB > name c2 'Z-ratio’ 
MTB > print c2 


Data Display 
Z-rat1o 
0.79057 1.31762 —0.26352 —0.52705 —0.79057 —0.79057 —0.52705 
0.52705 0.52705 —1.05409 0.52705 0.26352 0.79057 2.10819 
—0.52705 0.79057 —0.52705 —1.05409 —1.31762 —1.31762 —0.79057 
—0.26352 1.05409 —0.26352 0.79057 1.31762 —2.89875 0.00000 
—0.52705 0.26352 0.26352 -1.05409 0.26352 0.52705 —2.37171 
0.00000 —1.84466 1.05409 —0.79057 —2.10819 0.00000 —0.52705 
B etre 0.26352 —0.79057 —0.26352 —0.26352 —1.05409 0.26352 
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Appendix 4.A.2 A Proof of the Central Limit Theorem 


Lemma 


Proving Theorem 4.3.2 in its full generality is beyond the level of this text. However, 
we can establish a slightly weaker version of the result by assuming that the moment- 
generating function of each W; exists. 


Let Wi, W2,...be a set of random variables such that lim My, (t)= My(t) for allt in 
noo 


some interval about 0. Then lim Fy, (w) = Fw(w) for all —oo < w < 00. 
noo 


To prove the central limit theorem using moment-generating functions requires 
showing that 


: 2/2 
jim, Mon t-+in-ny(igoy ©) = Mz(t) =e"! 


For notational simplicity, let 
Wit---+Wi-ne  Sit---+S, 
/no 7 Jn 


where S; = (W; — “)/o. Notice that E(S;) =0 and Var(S;) = 1. Moreover, from 
Theorem 3.12.3, 


t n 
M (514-45, )/val) = | (=) 


where M(t) denotes the moment-generating function common to each of the S;’s. 
By virtue of the way the S;’s are defined, M(0) = 1, M“(0) = E(S;) =0, and 
M0) = Var(S;) = 1. Applying Taylor’s theorem, then, to M(t), we can write 
1 1 
M(th=1+M(O)t+ 5M® (rye =1t+ 5M Or) 


for some number r, |r| < |t|. Thus 
ie | ae (2) et 4! yO) ‘ pel 
n—>0o Jn n—>0o 2n ; n 
2 
; ey) 
=exp lim nln} 1+ —M*(s) 
noo 2n 


In|! ze FM(s)] —In(1) 
£ M(s) 


site! ty 
=exp lim —-M*(s)- 
n>oo 2. 
The existence of M(t) implies the existence of all its derivatives. In particular, 
M®(t) exists, so M(t) is continuous. Therefore, lim M(t) = M® (0) = 1. Since 
t> 
|s| <|t| /./n, s = 0 asn— ov, so 
lim M?(s)=M(0)=1 
noo 


Also, as n + 00, the quantity (t?/2n)M@ (s) > 0-1=0, so it plays the role of “Ax” 
in the definition of the derivative. Hence we obtain 


t : 2 , 
lim | M {| — = — 1nd (1/2)t 
sim | (=)| exp 2 n (1) =e 


Since this last expression is the moment-generating function for a standard normal 
random variable, the theorem is proved. 
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A towering figure in the development of both applied and mathematical statistics, 
Fisher had formal training in mathematics and theoretical physics, graduating from 
Cambridge in 1912. After a brief career as a teacher, he accepted a post in 1919 as 
Statistician at the Rothamsted Experimental Station. There, the day-to-day problems 
he encountered in collecting and interpreting agricultural data led directly to much of 
his most important work in the theory of estimation and experimental design. Fisher 
was also a prominent geneticist and devoted considerable time to the development of 
a quantitative argument that would support Darwin’s theory of natural selection. 
He returned to academia in 1933, succeeding Karl Pearson as the Galton Professor 
of Eugenics at the University of London. Fisher was knighted in 1952. 

—Ronald Aylmer Fisher (1890-1962) 


5.1 Introduction 


The ability of probability functions to describe, or model, experimental data was 
demonstrated in numerous examples in Chapter 4. In Section 4.2, for example, 
the Poisson distribution was shown to predict very well the number of alpha emis- 
sions from a radioactive source as well as the number of wars starting in a given 
year. In Section 4.3 another probability model, the normal curve, was applied to 
phenomena as diverse as breath analyzer readings and IQ scores. Other models 
illustrated in Chapter 4 included the exponential, negative binomial, and gamma 
distributions. 

All of these probability functions, of course, are actually families of models 
in the sense that each includes one or more parameters. The Poisson model, for 
instance, is indexed by the occurrence rate, A. Changing 4 changes the probabilities 
associated with px(k) [see Figure 5.1.1, which compares px(k) = e*A*/k!,k = 
0,1,2,..., for A=1 and A= 4]. Similarly, the binomial model is defined in terms of 
the success probability p; the normal distribution, by the two parameters yz ando. 
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Before any of these models can be applied, values need to be assigned to their 
parameters. Typically, this is done by taking a random sample (of n observations) 
and using those measurements to estimate the unknown parameter(s). 


Figure 5.1.1 0.4 0.4 
0.3 0.3 
rA=1 
py(k) 0.2 Py(k) 0.2 A=4 

0.1 0.1 

0 k 0 k 

0 2; 4 6 8 0 2 4 6 8 10 12 
Example Imagine being handed a coin whose probability, p, of coming up heads is unknown. 
5.1.1 Your assignment is to toss the coin three times and use the resulting sequence of H’s 


and T’s to suggest a value for p. Suppose the sequence of three tosses turns out to 
be HHT. Based on those outcomes, what can be reasonably inferred about p? 

Start by defining the random variable X to be the number of heads on a given 
toss. Then 


__f 1 ifa toss comes up heads 
~ |0. ifa toss comes up tails 


and the theoretical probability model for X is the function 


fork=1 
b= k {<= 1—k = Pp 
px(k) = p'(1— p) 1p FORE =O 
Expressed in terms of X, the sequence HHT corresponds to a sample of size n = 3, 
where X,; = 1, X.=1, and X3=0. 
Since the X;’s are independent random variables, the probability associated with 
the sample is p*(1 — p): 


P(X, =1NX.=1N X3=0) = P(X, =1)- P(X_=1)- P(X; =0) = p*(1— p) 


Knowing that our objective is to identify a plausible value (i.e., an “estimate’”) for p, 
it could be argued that a reasonable choice for that parameter would be the value 
that maximizes the probability of the sample. Figure 5.1.2 shows P(X; =1, X.=1, 
X3=0) as a function of p. By inspection, we see that the value that maximizes the 
probability of HHT is p= os 

More generally, suppose we toss the coin n times and record a set of outcomes 
X, =k, X.=k,..., and X, =k,. Then 


PO Shy ni ES tp pp) 
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0.16 
2 (1 _ 
cS 
0.08 
“, 
0.04 
4 Pp 
0 0.2 0.4 0.6 0.8 1.0 
2 
p=3 
Figure 5.1.2 
The value of p that maximizes P(X; =k,,..., X, =kn) is, of course, the value for 


n n 


Y ki n—>- kj 
which the derivative of p'=! (1—p) ‘=! with respect to p is 0. But 


n 


Y ki Pas ie yet es 
d/dp| p= (l—p) = |=)0k | p= —p) = 
i=1 


n yk n— 3 ki-1 
+ ba = a p= (-p) = (5.1.1) 


i=1 


If the derivative is set equal to zero, Equation 5.1.1 reduces to 


Waa-~+( Pk =n)p=0 
i=l i=l 


1 n 
(7) dk 
n}* 
i=1 
as the value of the parameter that is most consistent with the n observations 
ki, ko, ia Ky. 


Solving for p identifies 


Comment Any function of a random sample whose objective is to approximate 
a parameter is called a statistic, or an estimator. If 6 is the parameter being 
approximated, its estimator will be denoted 6. When an estimator is evaluated (by 
substituting the actual measurements recorded), the resulting number is called an 


n 

estimate. In Example 5.1.1, the function (+) }> X; is an estimator for p; the value 3 
i=l 

that is calculated when the n =3 observations are X; = 1, X= 1, and X3 =0 is an 


estimate of p. More specifically, (+) )> X; is a maximum likelihood estimator (for p) 


and 3 [= (+) x ki = (4) 2) is a maximum likelihood estimate (for p). = 


i=1 


In this chapter, we look at some of the practical, as well as the mathematical, 
issues involved in the problem of estimating parameters. How is the functional 
form of an estimator determined? What statistical properties does a given estima- 
tor have? What properties would we /ike an estimator to have? As we answer these 
questions, our focus will begin to shift away from the study of probability and toward 
the study of statistics. 
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5.2 Estimating Parameters: The Method of Maximum 
Likelihood and the Method of Moments 


Suppose Yj, Y2,..., Y, 1s a random sample from a continuous pdf fy(y) whose 
unknown parameter is 6. [Note: To emphasize that our focus is on the parameter, 
we will identify continuous pdfs in this chapter as fy (y; 6); similarly, discrete proba- 
bility models with an unknown parameter @ will be denoted px (k; 0)]. The question 
is, how should we use the data to approximate 6? 

In Example 5.1.1, we saw that the parameter p in the discrete probability model 
pxk; p) = pk — p)'-*, k =0, 1 could reasonably be estimated by the function 


(+) 2S k;, based on the random sample X; =k, X2=k2,..., Xn =kn. How would the 


i=1 
form of the estimate change if the data came from, say, an exponential distribution? 
Or a Poisson distribution? 

In this section we introduce two techniques for finding estimates—the method 
of maximum likelihood and the method of moments. Others are available, but these 
are the two that are the most widely used. Often, but not always, they give the same 
answer. 


The Method of Maximum Likelihood 


The basic idea behind maximum likelihood estimation is the rationale that was 
appealed to in Example 5.1.1. That is, it seems plausible to choose as the estimate 
for 6 the value of the parameter that maximizes the “likelihood” of the sample. The 
latter is measured by a likelihood function, which is simply the product of the under- 
lying pdf evaluated for each of the data points. In Example 5.1.1, the likelihood 
function for the sample HHT (i-e., for X¥; = 1, X. = 1, and X3 =0) is the product 


p’(1—p). 


Definition 5.2.1. Let k,, kx,..., k, be a random sample of size n from the 
discrete pdf px(k;@), where @ is an unknown parameter. The likelihood 
function, L(@), is the product of the pdf evaluated at the n k;’s. That is, 


LO) =| [ px: 4) 


i=l 
If y|, y2,---, Y, is a random sample of size n from a continuous pdf, fy(y; 4), 
where @ is an unknown parameter, the likelihood function is written 


L@)=[[ fOr 4) 


i=1 


Comment Joint pdfs and likelihood functions look the same, but the two are 
interpreted differently. A joint pdf defined for a set of n random variables is a mul- 
tivariate function of the values of those n random variables, either k, ko,..., ky, or 
Yi, 2,---, Yn. By contrast, L is a function of 6; it should not be considered a function 
of either the k;’s or the y,’s. 


Example 
5.2.1 
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Definition 5.2.2. Let L(6) = [] px(k; 8) and L(6) = [] fyi; 8) be the likeli- 
i=l i=l 

hood functions corresponding to random samples ky, kz, ...,k, and yj, yo,.--5 Yn 

drawn from the discrete pdf px(k; 6) and continuous pdf f(y; 6), respectively, 

where @ is an unknown parameter. In each case, let 6, be a value of the param- 

eter such that L(6,) > L(@) for all possible values of 6. Then 6, is called a 

maximum likelihood estimate for @. 


Applying the Method of Maximum Likelihood 


We will see in Example 5.2.1 and many subsequent examples that finding the 0, that 
maximizes a likelihood function is often an application of the calculus. Specifically, 
we solve the equation 4L(6) = 0 for 6. In some cases, a more tractable equation 
results by setting the derivative of In L(0) equal to 0. Since In L(@) increases with 
L(@), the same 6, that maximizes In L(@) also maximizes L(@). 


In Case Study 4.2.2, which discussed modeling a-particle emissions, the mean of the 
data k was used as the parameter A of the Poisson distribution. This choice seems 
reasonable, since A is the mean of the pdf. 

In this example, the choice of the sample mean as an estimate of the parameter 
A of the Poisson distribution will be justified via the method of maximum likelihood, 
using a small data set to introduce the technique. So, suppose that X; = 3, X.=5, 
X3=4, and X4=2 isa set of four independent observations representing the Poisson 
probability model, 


rk 
px(k; A)=e*—,k=0,1,2,... 


k! 
Find the maximum likelihood for i. 
According to Definition 5.2.1, 
3 w- es Ne 1 
Los ee eer 


3! 5! 4! 2! 3!5!4!2! 
Then In L(A) = —44 + 141Ind — 1n@!5!4!2!). Differentiating In L(A) with respect to A 
gives 
dinL(Q) | 14 


4 
dx ae 


To find the 4 that maximizes L(A), we set the derivative equal to zero. Here —4+ a = 
0 implies that 44 = 14, and the solution to this equation is 4 = a =3.5. 

Notice that the second derivative of L(A) is — ce which is negative for all A. Thus, 
A = 3.5 is indeed a true maximum of the likelihood function, as well as the only one. 
(Following the notation introduced in Definition 5.2.2, the number 3.5 is called the 
maximum likelihood estimate for 4, and we would write 4, = 3.5.) ==] 


Comment There is a better way to answer the question posed in Example 5.2.1. 
Rather than evaluate—and differentiate—the likelihood function for a particular 
sample observed (in this case, the four observations 3, 5, 4, and 2), we can get 
a more informative answer by considering the more general problem of taking a 
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random sample of size n from py(k; A) = ees and using the outcomes— X; = ky, 
X2=kp,..., X, =k, —to find a formula for the maximum likelihood estimate. 

For the Poisson pdf, the likelihood function based on such a sample would be 
written 


n aki ; 
La)=| [e* sea 


i=1 


As was the case in Example 5.2.1, In L(A) is easier to work with than L(A). Here, 


Inks =ar+ (24) Ina Sigil 


i=l i=1 


and 
k; 
dinL(a) " X 
=—n 
dh Xr 
Setting the derivative equal to 0 gives 
whi 
—n+ i=1 _0 
y Ko 
which implies that A, = =— =k. 


n 


4 
Reassuringly, for the particular example used in Example 5.2.1—n =4 and )° k; = 
i=l 
14—the formula just derived reduces to the maximum likelihood estimate of 14/4= 
3.5 that we found at the outset. 
The general result of k also justifies the choice of parameter estimate made in 
Case Study 4.2.2. 


Comment Implicit in Example 5.2.1 and the remarks following it is the important 
distinction between a maximum likelihood estimate and a maximum likelihood esti- 
mator. The first is a number or an expression representing a number; the second is a 
random variable. 
n 
Both the number 3.5 and the formula 1 >° k; are maximum likelihood estimates 
i=l 

for A. and would be denoted i, because both are considered numerical constants. 

If, on the other hand, we imagine the measurements before they are recorded — 


n 
that is, as the random variables X,, X2,..., X,—then the estimate formula 1 >- k; is 

= 

A 7 L 
more properly written as the random variable 1 >) xX; =X. 
i=l 
This last expression is the maximum likelihood estimator for 4 and would be 

denoted 4. Maximum likelihood estimators such as 4 have pdfs, expected values, 
and variances, whereas maximum likelihood estimates such as i, have none of these 
statistical properties. 


Example 
5.2.2 
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Suppose an isolated weather-reporting station has an electronic device whose time 
to failure is given by the exponential model 


1 
fry; = a0 0<y<w;0<0<oo 


The station also has a spare device, so the time until this instrument is not available 
is the sum of these two exponential pdfs, which is 


1 
fr 0; A= aye", 0<y<w; 0<0<a@ 


Five data points have been collected—9.2, 5.6, 18.4, 12.1, and 10.7. Find the 
maximum likelihood estimate for 6. 

Following the advice given in the Comment on p. 285, we begin by deriving a 
general formula for 6,—that is, by assuming that the data are the n observations 
Y1, Y25--+) Yn. The likelihood function then becomes 


n 


1 : 
L@)=]] mee 


i=l 


n n 
—(1/0) 3 yi 
— 9-2" (1 »] e i=l 
i=1 


and 


n 1 n 
InL(6) =—2nIn6 + In (1 ») = yi 
i=1 


i=1 


Setting the derivative of In L(@) equal to 0 gives 


dinL(@) —2n I< 
= i =0 
dé 6. @ 2d) 


which implies that 


1 n 
Oe a i 
2n 2) 
The final step is to evaluate numerically the formula for 6,. Substituting the 


5 

actual n =5 sample values recorded gives )> y; =9.2+5.6+ 18.4+ 12.1+ 10.7=56.0, 
i=l 

so 


6. = + (56.0) =5.6 
2(5) a 


Using Order Statistics as Maximum Likelihood Estimates 


Situations exist for which the equations aE) =0or = 0 are not meaningful 


and neither will yield a solution for @,. These occur when the range of the pdf from 
which the data are drawn is a function of the parameter being estimated. [This hap- 
pens, for instance, when the sample of y;’s come from the uniform pdf, fy (y; 6) =1/0, 
0< y <0.] The maximum likelihood estimates in these cases will be an order statistic, 
typically either ymin OT Ymax- 


dinL(@) 
dé 


288 Chapter 5 Estimation 


Example Suppose y;, y2,..., y, 18 a set of measurements representing an exponential pdf with 
5.2.3 4 = 1 but with an unknown “threshold” parameter, 6. That is, 


frstiee?™,. yee: O50 


(see Figure 5.2.1). Find the maximum likelihood estimate for 6. 


\ e W-%) 


fy O; 9) 


Figure 5.2.1 


Proceeding in the usual fashion, we start by deriving an expression for the 
likelihood function: 
n 


L(@)= I] eT i-9) 


i=1 
n 
— > yi tnd 
=e i=l 


din ant — 


= 0 will not work because 


dinL@) _ 
Here, finding 6, by solving the equation fa 


aa £(-5 eat n0) =n. Instead, we need to look at the likelihood function directly. 
i=1 


—Sytne ee 
Notice that L(@)=e ‘=! is maximized wien the exponent of e is maximized. 


But for given y, y2,..., Yn (and), making — 3 y; +n as large as possible requires 


=l 
that 6 be as large as possible. Figure 5.2.1 shows how large @ can be: It can be moved 
to the right only as far as the smallest order statistic. Any value of 6 larger than ymin 
would violate the condition on fy(y; 6) that y > 6. Therefore, 6. = ymin. = 


Case Study 5.2.1 


Each evening, the media report various averages and indices that are presented 
as portraying the state of the stock market. But do they? Are these numbers 
conveying any really useful information? Some financial analysts would say 
“no,” arguing that speculative markets tend to rise and fall randomly, as though 
some hidden roulette wheel were spinning out the figures. 

One way to test this theory is to model the up-and-down behavior of the 
markets as a geometric random variable. If this model were to fit, we would 
be able to argue that the market doesn’t use yesterday’s history to “decide” 

(Continued on next page) 
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whether to rise or fall the next day, nor does this history change the probability 
p of arise or | — p of a fall the following day. 

So, suppose that on a given Day 0 the market rose and the following Day 1 
it fell. Let the geometric random variable X represent the number of days the 
market falls (failures) before it rises again (a success). For example, if on Day 
2 the market rises, then X = 1. In that case px(1) = p. If the market declines on 
Days 2, 3, and 4, and then rises on Day 5, X =4 and px (4) =(1— p)?p. 

This model can be examined by comparing the theoretical distribution for 
pX(k) to what is observed in a speculative market. However, to do so, the 
parameter p must be estimated. The maximum likelihood estimate will prove 
a good choice. Suppose a random sample from the geometric distribution, 
ki, ko,..., kn, is given. Then 


n n ao 
L(p) =| | px) =] [a - p)* p= - p= 


i=l i=1 


n 


p 


and 


hon i 
InL(p) =In c — p)=! P| = (S74 -») In(l — p)+nInp 


i=1 


Setting the derivative of In L(p) equal to 0 gives the equation 


eki-n 
i=l n 
— —— + —-=0 
l=p P 


or, equivalently, 


(n-Sok) pm r)=0 


i=1 


Solving this equation gives p. =n/ >~ ki =1/k. 
i=l 


Now, turning to a data set to compare to the geometric model, we employ 
the widely used closing Dow Jones average for the years 2006 and 2007. The first 
column gives the value of k, the argument of the random variable X. Column 2 
presents the number of times X =k in the data set. 


Table 5.2.1 


k Observed Frequency Expected Frequency 


1 72 74.14 
2 35 31.20 
3 11 13.13 
4 6 5.52 
5 2 2.32 
6 2 1.69 


Source: finance.yahoo.com/of/hp.s=%SEDJI. 


(Continued on next page) 
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Example 
5.2.4 


(Case Study 5.2.1 continued) 


Note that the Observed Frequency column totals 128, which is n in the formula 
above for p,. From the table, we obtain 


.S k; = 1(72) +2(35) +3(11) +4(6) + 5(2) + 6(2) = 221 

i=l 
Then p, = 128/221 = 0.5792. Using this value, the estimated probability of, 
for example, px(2) = (1 — 0.5792)(0.5792) = 0.2437. If the model gives the 
probability of k = 2 to be 0.2437, then it seems reasonable to expect to see 
n(0.2437) = 128(0.2437) = 31.20 occurrences of X = 2. This is the second entry 
in the Expected Frequency column of the table. The other expected values are 
calculated similarly, except for the value corresponding to k = 6. In that case, we 
fill in whatever value makes the expected frequencies sum to n = 128. 

The close agreement between the Observed and Expected Frequency 
columns argues for the validity of the geometric model, using the maximum 
likelihood estimate. This suggests that the stock market doesn’t remember 
yesterday. 


Finding Maximum Likelihood Estimates When More Than 
One Parameter Is Unknown 


If a family of probability models is indexed by two or more unknown parameters — 
say, 01, 02,..., 0% —finding maximum likelihood estimates for the 6;’s requires the 
solution of a set of k simultaneous equations. If k = 2, for example, we would 
typically need to solve the system 


anL (1,62) _ 4 
30; 

an L(61,6) _ 5 
a0 


Suppose a random sample of size n is drawn from the two-parameter normal pdf 


1 _1 G-w? ' 
fr 03 4,0") = ==? —0<y<00;—-00<p<oo;a° >0 
J2n Vo? 


Use the method of maximum likelihood to find formulas for j., and o?. 
We start by finding L(j, 07) and In L(w, 07): 

L(u,o°) =|] 

i=l 


i 1 Oi) 
3 ght 


e 
200 


1 Qj-) 
=(2707)"*e2 2 


and 


Ie 2 ll< 2 
InL(t,.07) =—5 Ino”) — ee re 
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Moreover, 


——a 


abo LL) 


and 


dInL(u,07) | n 
do2 aay: + 


AG 2) Dov yy” 


Setting the two derivatives equal to zero gives the equations 


Y oi - Hw) =0 (5.2.1) 
i=1 
and 
—no? +) > (i —p)*=0 (5.2.2) 


i=1 


Equation 5.2.1 simplifies to 
oS yi=nye 
i=l 


n 
which implies that . = 4 > y; =y. Substituting j,, then, into Equation 5.2.2 gives 
i=l 


—no? +> (yi -y) =0 
i=1 
or 


1 n 
ns wie 
ero ae, y) 


Comment The method of maximum likelihood has a long history: Daniel Bernoulli 
was using it as early as 1777 (130). It was Ronald Fisher, though, in the early years 
of the twentieth century, who first studied the mathematical properties of likelihood 
estimation in any detail, and the procedure is often credited to him. = 


Questions 


5.2.1. A random sample of size 8— X, = 1, X, =0, X3=1, 


X4,=1,X5;=0,X,=1, X;=1, and Xs =0—is taken from 
the probability function 
px(k;0)=0*(1—6)'*, k=0,1; O0<0<1 


Find the maximum likelihood estimate for 6. 


5.2.2. The number of red chips and white chips in an a 
is unknown, but the proportion, p, of reds is either + or + 

A sample of size 5, drawn with replacement, yields the 
sequence red, white, white, red, and white. What is the 
maximum likelihood estimate for p? 


5.2.3. Use the sample Y; = 8.2, ¥, = 9.1, Y3; = 10.6, and 
=4.9 to calculate the maximum likelihood estimate for 

A in the exponential pdf 
frQrAy=ae, 


5.2.4. Suppose a random sample of size n is drawn from 
the probability model 


y>=0 


2k e-&* 


Dx (k; 8) = mo 


k=071) 23.2: 


Find a formula for the maximum likelihood estimator, 6. 
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5.2.5. Given that Y, = 2.3,¥, = 1.9, and Y; = 4.6 is a 

random sample from 

3-9/0 

ye 
;O)= 

Svs @) 664 


calculate the maximum likelihood estimate for 0. 


» yd 


5.2.6. Use the method of maximum likelihood to estimate 
6 in the pdf 


6 
fr(y; 0) = =e”, 
Y 2/y 


Evaluate 6, for the following random sample of size 
AY, 6.2, Y 7.0, Y3 2:5; and Y, =4.2. 


y2=0 


5.2.7. An engineer is creating a project scheduling pro- 
gram and recognizes that the tasks making up the project 
are not always completed on time. However, the com- 
pletion proportion tends to be fairly high. To reflect this 
condition, he uses the pdf 


fri 9) =0y*"," and 0<@ 


where y is the proportion of the task completed. Sup- 
pose that in his previous project, the proportions of tasks 
completed were 0.77, 0.82, 0.92, 0.94, and 0.98. Estimate 6. 


O<y<l, 


5.2.8. The following data show the number of occu- 
pants in passenger cars observed during one hour at a 
busy intersection in Los Angeles (69). Suppose it can 
be assumed that these data follow a geometric distribu- 
tion, py(k; p) = (1 — p)"'p,k = 1,2,.... Estimate p and 
compare the observed and expected frequencies for each 
value of X. 


Number of Occupants Frequency 

1 678 
2 227 
3 56 
4 28 
5 8 
6+ 14 

1011 


5.2.9. For the Major League Baseball seasons from 1950 
through 2008, there were fifty-nine nine-inning games in 
which one of the teams did not manage to get a hit. The 
data in the table give the number of no-hitters per season 
over this period. Assume that the data follow a Poisson 
distribution, 


_ AK 
Dx k; Me k=, 132% aes 


(a) Estimate 4 and compare the observed and expected 
frequencies. 

(b) Does the agreement (or lack of agreement) in part 
(a) come as a surprise? Explain. 


No. of No-Hitters Frequency 
0 6 
1 19 
2 12 
3 13 
4+ 9 


Source: en.wikipedia.org/wiki/List_of_Major_League_ 
Baseball_no-hitlers. 


5.2.10. (a) Based on the random sample Y; = 6.3, Y, = 
1.8, Y; = 14.2, and Y, = 7.6, use the method of maximum 
likelihood to estimate the parameter 6 in the uniform pdf 


1 
Ar D= 5, O<y<0 


(b) Suppose the random sample in part (a) represents the 
two-parameter uniform pdf 


Sv Qs 01, 02) = O<y<% 


0, — 0," 
Find the maximum likelihood estimates for 6, and 6). 


5.2.11. Find the maximum likelihood estimate for 6 in 
the pdf 


2y 
faa oe 


if a random sample of size 6 yielded the measurements 
0.70, 0.63, 0.92, 0.86, 0.43, and 0.21. 


fry = 
5.2.12. A random sample of size n is taken from the pdf 


Find an expression for 6, the maximum likelihood estima- 
tor for 6. 


5.2.13. If the random variable Y denotes an individual’s 


k 


income, Pareto’s law claims that P(Y > y)=(*) , where k 


is the entire population’s minimum income. It follows that 


) 
Fy(y)=1- («) , and, by differentiation, 


1 6+1 
frQ; y=on’ (~) » ySk: C21 
y 


Assume k is known. Find the maximum likelihood estima- 
tor for 6 if income information has been collected on a 
random sample of 25 individuals. 


5.2.14. The exponential pdf is a measure of lifetimes of 
devices that do not age (see Question 3.11.11). However, 
the exponential pdf is a special case of the Weibull dis- 
tribution, which measures time to failure of devices where 
the probability of failure increases as time does. A Weibull 
random variable Y has pdf fy(y; a, 8) =aBy’e-®” ,0<y, 
0<a,0<f8. 


(a) Find the maximum likelihood estimator for a assum- 
ing that 6 is known. 

(b) Suppose a and 8 are both unknown. Write down 
the equations that would be solved simultaneously 
to find the maximum likelihood estimators of a 
and fp. 
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5.2.15. Suppose a random sample of size n is drawn from 
a normal pdf where the mean y is known but the variance 
o° is unknown. Use the method of maximum likelihood 
to find a formula for G7. Compare your answer to the 
maximum likelihood estimator found in Example 5.2.4. 


The Method of Moments 


A second procedure for estimating parameters is the method of moments. Proposed 
near the turn of the twentieth century by the great British statistician, Karl 
Pearson, the method of moments is often more tractable than the method of max- 
imum likelihood in situations where the underlying probability model has multiple 


parameters. 
Suppose that Y is a continuous random variable and that its pdf is a function of 
s unknown parameters, 6), 2, ..., 0;. The first s moments of Y, if they exist, are given 


by the integrals 


[o.e) 
Eq) = | y+ fr(y3 61,@2,...,@s)dy, j=l,2,...,8 


(oe) 


In general, each E(Y/) will be a different function of the s parameters. That is, 


E(Y')=g1(61, 62, ..., 9s) 
E(Y?) = g0(01, 2, -.-, 9s) 


E(Y*) = gs(01, 0, eo) 


n F 
. . 7 . 1 J 
Corresponding to each theoretical moment, E(Y/), is a sample moment, — d yp. 
i= 
Intuitively, the jth sample moment is an approximation to the jth theoretical 
moment. Setting the two equal for each j produces a system of s simultaneous 


equations, the solutions to which are the desired set of estimates, 01., 02.,..., and Oe. 


Definition 5.2.3. Let yi, y2,..., y, be a random sample from the continuous 
pdf fr (y; 41, 2,..., 4s). The method of moments estimates, 01, O22, ..., ANd Ose, 
for the model’s unknown parameters are the solutions of the s simultaneous 


equations 
CO 1 r 
i Y f(y 885-1248.) dy =(2) Yo 
— ga 


eo) 1 n 
i y? fy (3 01, 2, «45 85) ay=(2) D9? 
m0. i=l 


lore) 1 n 
301,02,...,0;)dy=| — R 
/ y’ fry; 1, ) dy GF 


(oe) 


294 Chapter 5 Estimation 


Example 
5.2.5 


If the underlying random variable is discrete with pdf px(k; 0, 02,..., 4s), 
the method of moments estimates are the solutions of the system of equations, 


k! px (k; 01, 02,..., Os) =| - Kis JH 1yQyases 
Sk! pxlks 1, ) (<)> J 5 


allk allk 


Suppose that Y, = 0.42, Y, = 0.10, Y3; = 0.65, and Y, = 0.23 is a random sample of 
size 4 from the pdf 


hOoo=oy", Usyst 


Find the method of moments estimate for 6. 

Taking the same approach that we followed in finding maximum likelihood esti- 
mates, we will derive a general expression for the method of moments estimate 
before making any use of the four data points. Notice that only one equation needs 
to be solved because the pdf is indexed by just a single parameter. 


The first theoretical moment of Y is ria 


1 
EW)= | y Oyo! dy 
0 


6+1 |! 


YJ 
O6+1 
7 


pal 


0 


Setting E(Y) equal to 1 > y;(=y), the first sample moment, gives 


i=1 


6 = 
o+1 > 
which implies that the method of moments estimate for 6 is 
se 
e 1 _ y 
Here, y = ;(0.42 + 0.10 + 0.65 + 0.23) = 0.35, so 
0.35 
ag 


The gamma distribution, fy(y; r, A) = a y’ le, y > 0, often provides a good 
model for data that are inherently not symmetric, as the rainfall example below 
will show. Deriving maximum likelihood estimators for r and 4, though, is diffi- 
cult because I(r) does not have a closed-form derivative. However, the method of 
moments estimators are not hard to find. 


From Theorem 4.6.3, E(Y) = ; and Var(Y) = aie Recall that 


E(¥?) = Var(Y) +[E(Y)P’, 


so for the gamma distribution, 


rye ee) 


E(Y)=— 
( =a+(5 22 
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To find the method of moments estimators, form the two equations 


ry ere 7i¥- =" 


n 

ri _ ih ; 

From the first equation, r = = 2 yj. 
i= 


Substituting that value into the second equation gives 


oe, EGE) 


The solution of that equation for 4 gives its method of moments estimate: 


n 


and then 


Case Study 5.2.2 


In the western United States, the supply of water to support daily living, agri- 
culture, and industry is a matter of serious concern. For that reason, the U.S. 
Department of Agriculture has established a network of stations to record pre- 
cipitation. One such site is in California just south of Lake Tahoe, with the 
inviting name of Heavenly Valley. Columns 1 and 2 of Table 5.2.2 below give 
the monthly rainfall in inches for the 294 months in which there was some 
measurable precipitation, over a period of twenty-eight years. 


Table 5.2.2 

Inches Rainfall Observed Frequency Expected Frequency 
0-1 87 80.54 
1-2 58 57.19 
2-3 42 41.62 
3-4 23 30.44 
4-5 20 22.31 
5-6 14 16.38 
6-7 10 12.03 
7-8 13 8.84 
8-9 9 6.51 
9-10 4 4.79 
>10 14 13.35 

Source: www.wec.nrcs.usda.gov. 
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(Case Study 5.2.2 continued) 


These data are clearly not symmetric, which suggests that a gamma distri- 
bution might provide a good fit. The original, ungrouped data set provides the 
necessary sums for estimating r and A: 


294 294 
- y; = 942.0 and > y? = 6117.82 


i=l i=l 


Then A, = ———’° ___ — 03039 and r, = 2 (0.3039) = 0.9737. 


~~ 6117.82— yaq (942.0)? ~~ 294 


Integrating the gamma pdf over the rainfall interval limits and multiplying 
by n(= 294) gives the expected frequencies in the third column. The second 
entry in that column, for example, is given by 


2 0.9737 
0.3039 
294. 0.9737-1 — 0.3039 7 = 57.19 

i (0.9737) > c 4 
The above quantity and the others in the third column were calculated using the 


Minitab routine 


MTB > cdf cl; 
SUBC > gamma 0.9737 1/0.3039 


Clearly, the agreement between observed and expected frequencies is quite 
good. A visual approach to examining the fit between data and model is pre- 
sented in Figure 5.2.2, where the estimated gamma curve is superimposed on 
the data’s density-scaled histogram. 


0.35 4 


0.30 4 


0.25 + 


0.20 + 


Density 


0.15 4 


0.10 + 


0.05 4 


0.00 
0-1 02 23 34 45 56 67 78 89 9-10 10+ 


Monthly rainfall (in.) 
Figure 5.2.2 
The adequacy of the approximation here would come as no surprise to 


a meteorologist. The gamma distribution is frequently used to describe the 
variation in precipitation levels. 


Questions 


5.2.16. Let y, 2,...,y¥, be a random sample of size n 
from the pdf fy(y; 6) = =, 0 < y <0. Find a formula for 
the method of moments estimate for 96. Compare the val- 
ues of the method of moments estimate and the maximum 
likelihood estimate if a random sample of size 5 consists of 
the numbers 17, 92, 46, 39, and 56 (recall Question 5.2.12). 


5.2.17. Use the method of moments to estimate 6 in the 
pdf 


fr O)=@ +0)y" "1 —y), 
Assume that a random sample of size n has been collected. 


O<y<l 


5.2.18. A criminologist is searching through FBI files 
to document the prevalence of a rare double-whorl fin- 
gerprint. Among six consecutive sets of 100,000 prints 
scanned by a computer, the numbers of persons having the 
abnormality are 3, 0, 3, 4, 2, and 1, respectively. Assume 
that double whorls are Poisson events. Use the method of 
moments to estimate their occurrence rate, A. How would 
your answer change if A were estimated using the method 
of maximum likelihood? 


5.2.19. Find the method of moments estimate for A if a 
random sample of size n is taken from the exponential pdf, 
fror) =rhe,, y=. 


5.2.20. Suppose that Y, = 8.3, ¥5 = 4.9, Y; =2.6, and Y,= 
6.5 is a random sample of size 4 from the two-parameter 
uniform pdf, 


1 
;01,02)=—, 
Fv (ys 41, 92) 26; 


Use the method of moments to calculate 6,, and 6,,. 


0-2 <y<h+ 


5.2.21. Find a formula for the method of moments esti- 
mate for the parameter 0 in the Pareto pdf, 
1\eH! 
fr(yi6)=0R" (~) i Vek OS 
y 


Assume that k is known and that the data consist of a 
random sample of size n. Compare your answer to the 
maximum likelihood estimator found in Question 5.2.13. 


5.2.22. Calculate the method of moments estimate for the 
parameter @ in the probability function 


Px(k; 0)=0"(1—6)'*,  k=0,1 


if a sample of size 5 is the set of numbers 0, 0, 1, 0, 1. 
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5.2.23. Find the method of moments estimates for uw and 
o*, based on a random sample of size n drawn from a nor- 
mal pdf, where « = E(Y) and o? = Var(Y). Compare your 
answers with the maximum likelihood estimates derived 
in Example 5.2.4. 


5.2.24. Use the method of moments to derive formu- 
las for estimating the parameters r and p in the negative 
binomial pdf, 


k-1 r k-r 
Dx (k;7r, p)= an pd-p)", k=rnrt+l,... 


5.2.25. Bird songs can be characterized by the number 
of clusters of “syllables” that are strung together in rapid 
succession. If the last cluster is defined as a “success,” 
it may be reasonable to treat the number of clusters in 
a song as a geometric random variable. Does the model 
px(k) = (1 — p)*'p,k = 1,2,..., adequately describe the 
following distribution of 250 song lengths (100)? Begin 
by finding the method of moments estimate for p. Then 
calculate the set of “expected” frequencies. 


No. of Clusters/Song Frequency 


132 
52 


DANN WNPR 
\o 


5.2.26. Let y1, y2,..., Yn be a random sample from the 

continuous pdf fy(y; 41,2). Let 67 = + )°(y; — y)’. Show 
i=1 

that the solutions of the equations 


E(Y)=y and Var(Y) =67 


for 6, and 6; give the same results as using the equations 
in Definition 5.2.3. 


5.3 Interval Estimation 


Point estimates, no matter how they are determined, share the same fundamental 
weakness: They provide no indication of their inherent precision. We know, for 
instance, that 4 = X is both the maximum likelihood and the method of moments 
estimator for the Poisson parameter, 4. But suppose a sample of size 6 is taken from 
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Example 
5.3.1 


the probability model py(k) = e~*A*/k! and we find that A, = 6.8. Does it follow 
that the true A is likely to be close to A,—say, in the interval from 6.7 to 6.9—or 
is the estimation process so imprecise that A might actually be as small as 1.0, or 
as large as 12.0? Unfortunately, point estimates, by themselves, do not allow us to 
make those kinds of extrapolations. Any such statements require that the variation 
of the estimator be taken into account. 

The usual way to quantify the amount of uncertainty in an estimator is to con- 
struct a confidence interval. In principle, confidence intervals are ranges of numbers 
that have a high probability of “containing” the unknown parameter as an interior 
point. By looking at the width of a confidence interval, we can get a good sense of 
the estimator’s precision. 


Suppose that 6.5, 9.2, 9.9, and 12.4 constitute a random sample of size 4 from the pdf 


1 
PrOn = og? , 


That is, the four y;’s come from a normal distribution where o is equal to 0.8, but 
the mean, jz, is unknown. What values of jz are believable in light of the four data 
points? 

To answer that question requires that we keep the distinction between estimates 
and estimators clearly in mind. First of all, we HOW from Example 5.2.4 that the 


maximum likelihood estimate for ju is je = y = (+) ba yi = (4) (38.0) = 9.5. We also 


know something very specific about the probabilistic coe of ae maximum like- 


— Yep 
lihood estimator, Y: According to the corollary to Theorem 4.3.3, + a a= a8) /4 has 


will fall between two 


a standard normal pdf, fz(z). The probability, then, that v7 a 7 


specified values can be deduced from Table A.1 in the Appendix For example, 


Y= 
P(—1.96 < Z < 1.96) =0.95 = P [| —1.96< P <1.96 (5.3.1) 
0.8//4 


(see Figure 5.3.1). 
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Figure 5.3.1 


“Inverting” probability statements of the sort illustrated in Equation 5.3.1 is the 
mechanism by which we can identify a set of parameter values compatible with the 


sample data. If 
¥=p 
P{-1.96< < 1.96 ] =0.95 
0.8/4 
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then 


= 0.8 
~ 19602 <n<¥+ 1.960 ) = 0.95 
ry v4 v4 


which implies that the random interval 


0.8 0.8 
— 1.96, 7+ 196-7) 
(7 V4’ V4 


has a 95% chance of containing jy as an interior point. 
After substituting for Y, the random interval in this case reduces to 


0.8 0.8 
9.50 — 1.96 —, 9.50 + 1.96 — ] = (8.72, 10.28) 
Game aaa) 
We call (8.72, 10.28) a 95% confidence interval for jz. In the long run, 95% of 
the intervals constructed in this fashion will contain the unknown jp; the remain- 


ing 5% will lie either entirely to the left of « or entirely to the right. For a 
given set . ees of course, we have no way of knowing whether the calculated 


(5 — 1.96: °F, ¥+1.96- 0. 08) j is one of the 95% that contains jz or one of the 5% that 


does not. 
Figure 5.3.2 illustrates graphically the statistical implications associated with the 


random interval (Y — 1.9623, Yt 1.9628). For every different y, the interval will 


have a different location. While there is no way to know whether or not a given 
interval—in particular, the one the experimenter has just calculated—will include 
the unknown jp, we do have the reassurance that in the long run, 95% of all such 
intervals will. 


1 2 3 4 5 6 Z 8 


Possible 95% confidence intervals for yu 


Data set 


Figure 5.3.2 


Comment The behavior of confidence intervals can be modeled nicely by using a 
computer’s random number generator. The output in Table 5.3.1 is a case in point. 
Fifty simulations of the confidence interval described in Example 5.3.1 are displayed. 
That is, fifty samples, each of size n = 4, were drawn from the normal pdf 


1 aay 

(y3 L)= e 7\°} oe <y<oo 

Ir J2n (0.8) ” 

using Minitab’s RANDOM command. (To fully specify the model—and to know 
the value that each confidence interval was seeking to contain—the true was 
assumed to equal ten). For each sample of size n = 4, the lower and upper limits 
of the corresponding 95% confidence interval were calculated, using the formulas 
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Table 5.3.1 


MTB > random 50 cl-c4; 
SUBC > normal 10 0.8. 
MTB > rmean cl-c4 c5 
MTB > let c6 = c5 - 1.96*(0.8)/sqrt (4) 
MTB > let c7 = c5 + 1.96*(0.8)/sqrt (4) 
MTB > name c6 ‘Low.Lim.’ c7 ‘Upp.Lim.’ 
MTB > print c6 c7 
Data Display 
Row Low. Lim. Upp.Lim. 
1 8.7596 10.3276 
2 8.8763 10.4443 
3 8.8337 10.4017 
4 9.5800 11.1480 
5 8.5106 10.0786 
6 9.6946 11.2626 
7 8.7079 0.2759 
8 10.0014 5694 
9 9.3408 10.9088 
10 9.5428 11.1108 
At 8.4650 0.0330 
12 9.6346 2026 
13 9.2076 1O.7TS6 
14 9.2517 10.8197 
15 8.7568 10.3248 
16 9.8439 11.4119 
17 933297 10.8977 
18 9.5685 11.1365 
19 8.9728 10.5408 
20 845715, 10.1455 
21 9.3979 10.9659 
22 9.22115 10.7795 
23 9.6277 11.1957 
24 9.4252 10.9932 
25 9.6868 11.2548 
26 8.8779 10.4459 
27 9.1570 10.7250 
28 9.3277 10.8957 
29 9.1606 10.7286 
30 8.8919 10.4599 
31. 9.3838 10.9518 
32 8.7575 10:.3255 
33 10.4602 12.0282 
34 8.9437 10.5117 
35 9.0049 10.5729 
36 9.0148 10.5828 
37 8.8110 10.3790 
38 9.1981 10.7661 
39 9.0042 10.5722 
40 9.7019 11.2699 
41 9.2167 10.7847 
42 8.3901 9.9581 
43 8.6337 10.2017 
44 9.4606 11.0286 
45 9.3278 10.8958 
46 8.5843 10.1523 
47 9.0541 10.6221 
48 9.2042 10.7722 
49 973:27.10 10.8390 
50 9.5697 Nal Deran es od fg 


Contains mw = 10? 


Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
NO 

Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
NO 

Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
NO 

Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 
Yes 


47 of the 50 

95% confidence 
intervals contain 
the true (= 10) 
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0.8 

Low.Lim. = y — 1.96 — 
ier 

0.8 

Upp.Lim. = y + 1.96 — 
pp y ii 


As the last column in the DATA DISPLAY indicates, only three of the fifty confi- 
dence intervals fail to contain 4. = 10: Samples eight and thirty-three yield intervals 
that lie entirely to the right of the parameter, while sample forty-two produces a 
range of values that lies entirely to the left. The remaining forty-seven intervals, 
though, or 94% (= pa x 100), do contain the true value of ju as an interior point. 


Case Study 5.3.1 


In the eighth century B.C., the Etruscan civilization was the most advanced in 
all of Italy. Its art forms and political innovations were destined to leave indeli- 
ble marks on the entire Western world. Originally located along the western 
coast between the Arno and Tiber Rivers (the region now known as Tuscany), 
it spread quickly across the Apennines and eventually overran much of Italy. 
But as quickly as it came, it faded. Militarily it was to prove no match for the 
burgeoning Roman legions, and by the dawn of Christianity it was all but gone. 

No written history from the Etruscan empire has ever been found, and 
to this day its origins remain shrouded in mystery. Were the Etruscans native 
Italians, or were they immigrants? And if they were immigrants, where did 
they come from? Much of what is known has come from anthropometric 
studies — that is, investigations that use body measurements to determine racial 
characteristics and ethnic origins. 

A case in point is the set of data given in Table 5.3.2, showing the sizes of 
eighty-four Etruscan skulls unearthed in various archaeological digs throughout 
Italy (6). The sample mean, y, of those measurements is 143.8 mm. Researchers 
believe that skull widths of present-day Italian males are normally distributed 
with a mean (2) of 132.4mm and a standard deviation (c ) of 6.0 mm. What does 


Table 5.3.2 


Maximum Head Breadths (mm) of 84 Etruscan Males 


(Continued on next page) 
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(Case Study 5.3.1 continued) 


the difference between y = 143.8 and px = 132.4 imply about the likelihood that 
Etruscans and Italians share the same ethnic origin? 

One way to answer that question is to construct a 95% confidence inter- 
val for the true mean of the population represented by the eighty-four y,’s in 
Table 5.3.2. If that confidence interval fails to contain uw = 132.4, it could be 
argued that the Etruscans were not the forebears of modern Italians. (Of course, 
it would also be necessary to factor in whatever evolutionary trends in skull 
sizes have occurred for Homo sapiens, in general, over the past three thousand 
years.) 

It follows from the discussion in Example 5.3.1 that the endpoints for a 95% 
confidence interval for jz are given by the general formula 


(7- 1.96- 2, y+ 1.96- <.) 
Jn Jn 


Here, that expression reduces to 


(1438 — 1.96. oe 143.8+ 1.96. =) = (142.5 mm, 145.1 mm) 
/84 84 

Since the value jz = 132.4 is not contained in the 95% confidence interval (or 

even close to being contained), we would conclude that a sample mean of 143.8 

(based on a sample of size 84) is not likely to have come from a normal popula- 

tion where jz = 132.4 (and o =6.0). It would appear, in other words, that Italians 

are not the direct descendants of Etruscans. 


Comment Random intervals can be constructed to have whatever “confidence” we 
choose. Suppose Z,/2 is defined to be the value for which P(Z > Zy/2) =a/2. If a = 
0.05, for example, Zy/2 = Z.025 = 1.96. A 100(1 — w)% confidence interval for w, then, 
is the range of numbers 


= o T4 o 
Y — £a/2 eis Za/2 Sh 


In practice, a is typically set at either 0.10, 0.05, or 0.01, although in some fields 50% 
confidence intervals are frequently used. 


Confidence Intervals for the Binomial Parameter, p 


Perhaps the most frequently encountered applications of confidence intervals are 
those involving the binomial parameter, p. Opinion surveys are often the context: 
When polls are released, it has become standard practice to issue a disclaimer by 
saying that the findings have a certain margin of error. As we will see later in this 
section, margins of error are related to 95% confidence intervals. 

The inversion technique followed in Example 5.3.1 can be applied to large- 
sample binomial random variables as well. We know from Theorem 4.3.1 that 


(X —np)//np( — p)=(X/n— p)/V pU— p)/n 


Theorem 
5.3.1 
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has approximately a standard normal distribution when X is binomial and n is large. 
It is also true that the pdf describing 


X/n— p 


| (X/n)—-X/n) 
n 


can be approximated by fz(z), a result that seems plausible given that x is the 
maximum likelihood estimator for p. 
Therefore, 


X/n— p 


= Za/2 
(X/n)\A—X/n) 
V n 


Rewriting Equation 5.3.2 by isolating p in the center of the inequalities leads to the 
formula given in Theorem 5.3.1. 


P| —Za/2 < =l-a (5.3.2) 


Let k be the number of successes in n independent trials, where n is large and 
p = P(success) is unknown. An approximate 100(1 — w)% confidence interval for p is 
the set of numbers 


E [(k/n\—k/n) k [erect 
Za/2 eo Zg24, 
n n n n 


Case Study 5.3.2 


A majority of Americans have favored increased fuel efficiency for automobiles. 
Some do not, primarily because of concern over increased costs, or from general 
opposition to government mandates. The public’s intensity about the issue tends 
to fluctuate with the price of gasoline. In the summer of 2008, when the national 
average of prices for regular unleaded gasoline exceeded $4 per gallon, fuel 
efficiency became part of the political landscape. 

How much the public does favor increased fuel efficiency has been the sub- 
ject of numerous polls. A Gallup telephone poll of 1012 adults (18 and over) 
in March of 2009 reported that 810 favored the setting of higher fuel-efficiency 
standards for automobiles. 

Given that n = 1012 and k = 810, the “believable” values for p, the prob- 
ability that an adult does favor efficiency, according to Theorem 5.3.1, are the 
proportions from 0.776 to 0.825: 


810 | gg, /810/1012)4 — 810/1012) 810 | | 4, /(810/1012)(1 = 810/1012) 
1012 : 1012 * 1012 : 1012 


= (0.776, 0.825) 


If the true proportion of Americans, in other words, who support increased 
fuel efficiency is less than 0.776 or greater than 0.825, it would be unlikely 
that a sample proportion (based on 1012 responses) would be the observed 


810/1012 = 0.800. 
Source: http://www.gallup.com/poll/118543/Americans-Green-Light-Higher-Fuel-Efficiency-Standards.aspx. 
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Comment We call (0.776, 0.825) a 95% confidence interval for p, but it does not 
follow that p has a 95% chance of lying between 0.776 and 0.825. The parameter p 
is a constant, so it falls between 0.776 and 0.825 either 0% of the time or 100% of 
the time. The “95%” refers to the procedure by which the interval is constructed, not 
to any particular interval. This, of course, is entirely analogous to the interpretation 
given earlier to 95% confidence intervals for pu. 


Comment Robert Frost was certainly more familiar with iambic pentameter than 
he was with estimated parameters, but in 1942 he wrote a couplet that sounds very 
much like a poet’s perception of a confidence interval (98): 


We dance round in a ring and suppose, 
But the Secret sits in the middle and knows. 


Central to every statistical software package is a random number generator. Two or 
three simple commands are typically all that are required to output a sample of size 
n representing any of the standard probability models. But how can we be certain 
that numbers purporting to be random observations from, say, a normal distribution 
with j4 = 50 and o = 10 actually do represent that particular pdf? 

The answer is, we cannot; however, a number of “tests” are available to check 
whether the simulated measurements appear to be random with respect to a given 
criterion. One such procedure is the median test. 

Suppose yj, y2,..., ¥, denote measurements presumed to have come from a 
continuous pdf fy(y). Let k denote the number of y;’s that are less than the median 
of fy(y). If the sample is random, we would expect the difference between k and 5 
to be small. More specifically, a 95% confidence interval based on A should contain 
the value 0.5. 

Listed in Table 5.3.3 is a set of sixty y;’s generated by Minitab to represent the 
exponential pdf, fy (vy) =e-”, y > 0. Does this sample pass the median test? 

The median here is m = 0.69315: 

=l-e"=0.5 


m 
/ e *dy=-e” 
0 0 


which implies that m = —1n(0.5) = 0.69315. Notice that of the sixty entries in 
Table 5.3.3, a total of k = 26 (those marked with an asterisk, *) fall to the left of 
the median. For these particular y;’s, then, x = a = 0.433. 


m 


Table 5.3.3 


0.00940* = 0.75095 2.32466 0.66715* 3.38765 3.01784 0.05509* 
0.93661 1.39603 0.50795*  0.11041* = 2.89577 1.20041 1.44422 
0.46474* = 0.48272* = 0.48223* = 3.59149 1.38016 0.41382*  0.31684* 
0.58175* 0.86681 0.55491* 0.07451* = 1.88641 2.40564 1.07111 
5.05936 0.04804* = 0.07498* — 1.52084 1.06972 0.62928*  —0.09433* 
1.83196 1.91987 1.92874 1.93181 0.78811 2.16919 1.16045 
0.81223 1.84549 1.20752 0.11387*  0.38966* = 0.42250* = 0.77279 
1.31728 0.81077 0.59111*  0.36793*  0.16938* 2.41135 0.21528* 
0.54938* = 0.73217 0.52019* = 0.73169 


* number < 0.69315 [= median of fy(y) =e”, y > 0] 
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Let p denote the (unknown) probability that a random observation produced 
by Minitab’s generator will lie to the left of the pdf’s median. Based on these sixty 
observations, the 95% confidence interval for p is the range of numbers extending 
from 0.308 to 0.558: 


+ 1.96 
60 60 


(F _ 1.96 (26/60)(1 — =o) 26 (26/60) (1 — 26/60) ~ (0.308, 0.558) 


The fact that the value p = 0.50 is contained in the confidence interval implies 
that these data do pass the median test. It is entirely believable, in other words, that a 
bona fide exponential random sample of size 60 would have twenty-six observations 
falling below the pdf’s median, and thirty-four above. a 


Margin of Error 


In the popular press, estimates for p (i.e., values of k) are typically accompanied by 
a margin of error, as opposed to a confidence interval. The two are related: A mar- 
gin of error is half the maximum width of a 95% confidence interval. (The number 
actually quoted is usually expressed as a percentage.) 

Let w denote the width of a 95% confidence interval for p. From Theorem 5.3.1, 


£1 96, [EMG = in) E me oeEDy 
nN n n n 
op, [kim =K/n) 
n 


Notice that for fixed n, w is a function of the product (£) (1 — £). But given that 
0 < * <1, the largest value that (£) (1— *) can achieve is 5-4, or } (see Ques- 


2° 2? 
tion 5.3.18). Therefore, 
1 
max w = 3.92,/ — 
4n 


Definition 5.3.1. The margin of error associated with an estimate k where k is 
the number of successes in n independent trials, is 100d%, where 


1.96 
~ 2/n 


In the mid-term elections of 2006, the political winds were shifting. One of the key 
races for control of the Senate was in Virginia, where challenger Jim Webb and 
incumbent George Allen were in a very tight race. Just a week before the election, 
the Associated Press reported on a CNN poll based on telephone interviews of 597 
registered voters who identified themselves as likely to vote. Webb was the choice of 
299 of those surveyed. The article went on to state, “Because Webb’s edge is equal 
to the margin of error of plus or minus 4 percentage points, it means that he can be 
considered slightly ahead.” 
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Is the margin of error in fact 4%? Applying Definition 5.3.1 (with n = 597) shows 
that the margin of error associated with the poll’s result, using a 95% confidence 
interval, is indeed 4%: 


1.96 
2/597 


Notice that the margin of error has nothing to do with the actual survey results. 
Had the percentage of respondents preferring Webb been 25%, 75%, or any other 
number, the margin of error, by definition, would have been the same. 

The more important question is whether these results have any real meaning in 
what was clearly to be a close election. 

Source: http://archive.newsmax.com/archives/ic/2006/10/31/72811.shtml?s=ic. = 


= 0.040 


About the Data Example 5.3.3 shows how the use of the margin of error has been 
badly handled by the media. The faulty interpretations are particularly prevalent in 
the context of political polls, especially since media reports of polls fail to give the 
confidence level, which is always taken to be 95%. Another issue is whether the con- 
fidence intervals provided are in fact useful. In Example 5.3.3, the 95% confidence 
interval has margin of error 4% and is 


(0.501 — 0.040, 0.501 + 0.040) = (0.461, 0.541) 


However, such a margin of error yields a confidence interval that is too wide to 
provide any meaningful information. The campaign had had media attention for 
months. Even a less-than-astute political observer would have been quite certain 
that the proportion of people voting for Webb would be between 0.461 and 0.541. 
As it turned out, the race was as close as predicted, and Webb won by a margin of 
just over seven thousand votes out of more than two million cast. 

Even when political races are not as close as the Webb—Allen race, persistent 
misinterpretations abound. Here is what happens. A poll (based on a sample of n 
voters) is conducted, showing, for example, that 52% of the respondents intend to 
support Candidate A and 48%, Candidate B. Moreover, the corresponding margin 
of error, based on the sample of size n, is (correctly) reported to be, say, 5%. What 
often comes next is a statement that the race is a “statistical tie” or a “statistical 
dead heat” because the difference between the two percentages, 52% — 48% = 4%, 
is within the 5% margin of error. Is that statement true? No. Is it even close to being 
true? No. 

If the observed difference in the percentages supporting Candidate A and 
Candidate B is 4% and the margin of error is 5%, then the widest possible 95% 
confidence interval for p, the true difference between the two percentages (p = 
Candidate A’s true % — Candidate B’s true %) would be 


(4% — 5%, 4% + 5%) = (—-1%, 9%) 


The latter implies that we should not rule out the possibility that the true value for 
p could be as small as —1% (in which case Candidate B would win a tight race) or as 
large as +9% (in which case Candidate A would win in a landslide). The serious mis- 
take in the “statistical tie” terminology is the implication that all the possible values 
from —1% to +9% are equally likely. That is simply not true. For every confidence 
interval, parameter values near the center are much more plausible than those near 
either the left-hand or right-hand endpoints. Here, a 4% lead for Candidate A ina 


Theorem 
5.3.2 
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poll that has a5% margin of error is not a “tie” —quite the contrary, it would more 
properly be interpreted as almost a guarantee that Candidate A will win. 

Misinterpretations aside, there is yet a more fundamental problem in using the 
margin of error as a measure of the day-to-day or week-to-week variation in political 
polls. By definition, the margin of error refers to sampling variation—that is, it 
reflects the extent to which the estimator p = x varies if repeated samples of size 
n are drawn from the same population. Consecutive political polls, though, do not 
represent the same population. Between one poll and the next, a variety of scenarios 
can transpire that can fundamentally change the opinions of the voting population — 
one candidate may give an especially good speech or make an embarrassing gaffe, a 
scandal can emerge that seriously damages someone’s reputation, or a world event 
comes to pass that for one reason or another reflects more negatively on one candi- 
date than the other. Although all of these possibilities have the potential to influence 
the value of * much more than sampling variability can, none of them is included in 
the margin of error. 


Choosing Sample Sizes 


Related to confidence intervals and margins of error is an important experimental 
design question. Suppose a researcher wishes to estimate the binomial parameter p 
based on results from a series of n independent trials, but n has yet to be determined. 
Larger values of n will, of course, yield estimates having greater precision, but more 
observations also demand greater expenditures of time and money. How can those 
two concerns best be reconciled? 

If the experimenter can articulate the minimal degree of precision that would 
be considered acceptable, a Z transformation can be used to calculate the smallest 
(i.e., the cheapest) sample size capable of achieving that objective. For example, 
suppose we want * to have at least a 100(1 — a)% probability of lying within a 
distance d of p. The problem is solved, then, if we can find the smallest n for 
which 


x 
p(-ds7-psd)=1-a (5.3.3) 


Let * be the estimator for the parameter p in a binomial distribution. In order for x 
to have at least a 100(1 — a)% probability of being within a distance d of p, the sample 
size should be no smaller than 


dy 25/2 
~ 4d? 
where Zq/2 is the value for which P(Z > Zq/2) = 0/2. 


Proof Start by dividing the terms in the probability portion of Equation 5.3.3 by the 
standard deviation of * to form an approximate Z ratio: 


p(-az%-p<a)=r| -d | X/n-p d 
~n = Jpl—p)/n~ J/pU—p)/n~ J/pd—p)/n 


=p| <Z< |=! a 
Vp(l— p)/n Vv p(l— p)/n 
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But P(—Za/2 < Z < Za/2) =1—a@, so 
d 
Jpd=pyin 
which implies that 
_ 24/2P(1— p) 
=a 
Equation 5.3.4 is not an acceptable final answer, though, because the right-hand 


side is a function of p, the unknown parameter. But p(1 — p) < ; for0<p<1,so 
the sample size 


(5.3.4) 


2 
= Sa/2 
4d? 
would necessarily cause * to satisfy Equation 5.3.3, regardless of the actual 
value of p. (Notice the connection between the statements of Theorem 5.3.2 and 


Definition 5.3.1.) 


A public health survey is being planned in a large metropolitan area for the purpose 
of estimating the proportion of children, ages zero to fourteen, who are lacking 
adequate polio immunization. Organizers of the project would like the sample pro- 
portion of inadequately immunized children, ~, to have at least a 98% probability 
of being within 0.05 of the true proportion, p. How large should the sample be? 

Here 100(1 — a) = 98, so w = 0.02 and Zy/2 = 2.33. By Theorem 5.3.2, then, the 
smallest acceptable sample size is 543: 


(2.33)? 
n= 
4(0.05)2 
= 543 


Comment Occasionally, there may be reason to believe that p is necessarily less 
than some number r;, where r; < s, or greater than some number r2, where rz > 5. 
If so, the factors p(1 — p) in Equation 5.3.4 can be replaced by either r)(1 — 71) or 
ro(1 — rz), and the sample size required to estimate p with a specified precision will 
be reduced, perhaps by a considerable amount. 

Suppose, for example, that previous immunization studies suggest that no more 
than 20% of children between the ages of zero and fourteen are inadequately 


immunized. The smallest sample size, then, for which 
x 
P (-0.03 <—-p< 0.5) = 0.98 
n 
is 348, an n that represents almost a 36% reduction (= eee x 100) from the 


original 543: 


(2.33)" 
| eee 
(0.05)2 

= 348 


(0.20) (0.80) 


Comment Theorems 5.3.1 and 5.3.2 are both based on the assumption that the 


X in * varies according to a binomial model. What we learned in Section 3.3, 
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though, seems to contradict that assumption: Samples used in opinion surveys 
are invariably drawn without replacement, in which case X is hypergeometric, not 
binomial. The consequences of that particular “error,” however, are easily corrected 
and frequently negligible. 

It can be shown mathematically that the expected value of ~ is the same 
regardless of whether X is binomial or hypergeometric; its variance, though, is 


different. If X is binomial, 
X 1- 
Var (=) er 2) 


n 


(3)<" 5" Gee) 
Var = 

n n N-1 
where N is the total number of subjects in the population. 

Since 4" <1, the actual variance of * is somewhat smaller than the (binomial) 
variance we have been assuming, pup) The ratio — is called the finite correction 
factor. If N is much larger than n, which is typically the case, then the magnitude 
of 4=" will be so close to 1 that the variance of ~ is equal to pp) for all practical 
purposes. Thus the “binomial” assumption in those situations is more than adequate. 
Only when the sample is a sizeable fraction of the population do we need to include 


If X is hypergeometric, 


the finite correction factor in any calculations that involve the variance of x. = 


Questions 


5.3.1. A commonly used IQ test is scaled to have a mean 
of 100 and a standard deviation of o = 15. A school 
counselor was curious about the average IQ of the stu- 
dents in her school and took a random sample of fifty 
students’ IQ scores. The average of these was y = 107.9. 
Find a 95% confidence interval for the student IQ in the 
school. 


5.3.2. The production of a nationally marketed deter- 
gent results in certain workers receiving prolonged expo- 
sures to a Bacillus subtilis enzyme. Nineteen workers 
were tested to determine the effects of those expo- 
sures, if any, on various respiratory functions. One such 
function, air-flow rate, is measured by computing the 
ratio of a person’s forced expiratory volume (FEV,) 
to his or her vital capacity (VC). (Vital capacity is 
the maximum volume of air a person can exhale after 
taking as deep a breath as possible; FEV, is the max- 
imum volume of air a person can exhale in one sec- 
ond.) In persons with no lung dysfunction, the “norm” 
for FEV,/VC ratios is 0.80. Based on the following data 
(164), is it believable that exposure to the Bacillus sub- 
tilis enzyme has no effect on the FEV,/VC ratio? Answer 
the question by constructing a 95% confidence interval. 
Assume that FEV,/VC ratios are normally distributed with 
o =0.09. 


Subject FEV,/VC Subject FEV,/VC 
RH 0.61 WS 0.78 
RB 0.70 RV 0.84 
MB 0.63 EN 0.83 
DM 0.76 WD 0.82 
WB 0.67 FR 0.74 
RB 0.72 PD 0.85 
BF 0.64 EB 0.73 
JT 0.82 PC 0.85 
PS 0.88 RW 0.87 
RB 0.82 


5.3.3. Mercury pollution is widely recognized as a serious 
ecological problem. Much of the mercury released into the 
environment originates as a byproduct of coal burning and 
other industrial processes. It does not become dangerous 
until it falls into large bodies of water, where microor- 
ganisms convert it to methylmercury (CH;”’), an organic 
form that is particularly toxic. Fish are the intermediaries: 
They ingest and absorb the methylmercury and are then 
eaten by humans. Men and women, however, may not 
metabolize CH;”’ at the same rate. In one study investi- 
gating that issue, six women were given a known amount 
of protein-bound methylmercury. Shown in the follow- 
ing table are the half-lives of the methylmercury in their 
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systems (114). For men, the average CH;” half-life is 
believed to be eighty days. Assume that for both genders, 
CH;” half-lives are normally distributed with a standard 
deviation (0) of eight days. Construct a 95% confidence 
interval for the true female CH;” half-life. Based on these 
data, is it believable that males and females metabolize 


methylmercury at the same rate? Explain. 


Females CH;°° Half-Life 


AE 52 
EH 69 
LJ 73 
AN 88 
KR 87 
LU 56 


5.3.4. A physician who has a group of thirty-eight female 
patients aged 18 to 24 on a special diet wishes to estimate 
the effect of the diet on total serum cholesterol. For this 
group, their average serum cholesterol is 188.4 (measured 
in mg/100mL). Because of a large-scale government study, 
the physician is willing to assume that the total serum 
cholesterol measurements are normally distributed with 
standard deviation of o = 40.7. Find a 95% confidence 
interval of the mean serum cholesterol of patients on the 
special diet. Does the diet seem to have any effect on 
their serum cholesterol, given that the national average for 
women aged 18 to 24 is 192.0? 


5.3.5. Suppose a sample of size n is to be drawn from 
a normal distribution where o is known to be 14.3. How 
large does n have to be to guarantee that the length of the 
95% confidence interval for jy will be less than 3.06? 


5.3.6. What “confidence” would be associated with each 
of the following intervals? Assume that the random vari- 
able Y is normally distributed and that o is known. 


(a) (F- 1.64. 5, F+2.33- 5) 
(b) (—00, 3+ 2.58- =) 
(c) (y-1.64- 5.5) 
5.3.7. Five independent samples, each of size n, are to be 


drawn from a normal distribution where o is known. For 
each sample, the interval (5 0.96 - Ta y+1.06- =) will 


be constructed. What is the probability that at least four 
of the intervals will contain the unknown ju? 


5.3.8. Suppose that y,,y2,...,y, 1S a random sam- 
ple of size n from a normal distribution where o 
is known. Depending on how the tail-area probabili- 
ties are split up, an infinite number of random intervals 


having a 95% probability of containing can be con- 
structed. What is unique about the particular interval 


F—-1.96--=,F+1.96- =)? 
Vi Vi 


5.3.9. If the standard deviation (a) associated with the 
pdf that produced the following sample is 3.6, would it be 
correct to claim that 


3.6 3.6 
2.61 — 1.96 - —,2.61+4+ 1.96. —= } =(1.03, 4.19 
( V20 al 


is a 95% confidence interval for w? Explain. 


235 0.1 0.2 1.3 
3.2 0.1 0.1 1.4 
0.5 0.2 0.4 
0.4 74 1.8 2.1 
0.3 8.6 0.3 


5.3.10. In 1927, the year he hit sixty home runs, Babe 
Ruth batted .356, having collected 192 hits in 540 official 
at-bats (140). Based on his performance that season, con- 
struct a 95% confidence interval for Ruth’s probability of 
getting a hit in a future at-bat. 


5.3.11. To buy a thirty-second commercial break dur- 
ing the telecast of Super Bowl XXIX cost approxi- 
mately $1,000,000. Not surprisingly, potential sponsors 
wanted to know how many people might be watch- 
ing. In a survey of 1015 potential viewers, 281 said 
they expected to see less than a quarter of the adver- 
tisements aired during the game. Define the rele- 
vant parameter and estimate it using a 90% confidence 
interval. 


5.3.12. During one of the first “beer wars” in the early 
1980s, a taste test between Schlitz and Budweiser was the 
focus of a nationally broadcast TV commercial. One hun- 
dred people agreed to drink from two unmarked mugs and 
indicate which of the two beers they liked better; fifty-four 
said, “Bud.” Construct and interpret the corresponding 
95% confidence interval for p, the true proportion of 
beer drinkers who prefered Budweiser to Schlitz. How 
would Budweiser and Schlitz executives each have put 
these results in the best possible light for their respective 
companies? 


5.3.13. The Pew Research Center did a survey of 2253 
adults and discovered that 63% of them had broadband 
Internet connections in their homes. The survey report 
noted that this figure represented a “significant jump” 
from the similar figure of 54% from two years earlier. One 
way to define “significant jump” is to show that the earlier 
number does not lie in the 95% confidence interval. Was 
the increase significant by this definition? 


Source: —http://www.pewinternet.org/Reports/2009/10-Home-Broad 
band-Adoption-2009.aspx. 


5.3.14. If (0.57, 0.63) is a 50% confidence interval for 
p, what does k equal and how many observations were 


taken? 


5.3.15. Suppose a coin is to be tossed n times for the pur- 
pose of estimating p, where p= P(heads). How large must 
n be to guarantee that the length of the 99% confidence 
interval for p will be less than 0.02? 


5.3.16. On the morning of November 9, 1994—the day 
after the electoral landslide that had returned Republicans 
to power in both branches of Congress —several key races 
were still in doubt. The most prominent was the Washing- 
ton contest involving Democrat Tom Foley, the reigning 
speaker of the house. An Associated Press story showed 
how narrow the margin had become (120): 


With 99 percent of precincts reporting, Foley trailed 
Republican challenger George Nethercutt by just 
2,174 votes, or 50.6 percent to 49.4 percent. About 
14,000 absentee ballots remained uncounted, mak- 
ing the race too close to call. 


Let p = P(Absentee voter prefers Foley). How small 
could p have been and still have given Foley a 20% chance 
of overcoming Nethercutt’s lead and winning the election? 


5.3.17. Which of the following two intervals has 
the greater probability of containing the binomial 


parameter p? 

E aa [X/nyA=X]ny_ ae eat) 
n n n n 

x 


« (bs) 


5.3.18. Examine the first two derivatives of the func- 
tion g(p) = pC — p) to verify the claim on p. 305 that 
p(l—p)<}for0<p<l. 


5.3.19. The financial crisis of 2008 highlighted the issue 
of excessive compensation for business CEOs. In a Gallup 
poll in the summer of 2009, 998 adults were asked, “Do 
you favor or oppose the federal government taking steps 
to limit the pay of executives at major companies?”, with 
59% responding in favor. The report of the poll noted a 
margin of error of +3 percentage points. Verify the margin 
of error and construct a 95% confidence interval. 


Source:  http://www.gallup.com/poll/120872/Americans-Favor-Gov- 


Action-Limit-Executive-Pay.aspx. 


5.3.20. Viral infections contracted early during a 
woman’s pregnancy can be very harmful to the fetus. One 
study found a total of 86 deaths and birth defects among 
202 pregnancies complicated by a first-trimester German 
measles infection (45). Is it believable that the true pro- 
portion of abnormal births under similar circumstances 
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could be as high as 50%? Answer the question by cal- 
culating the margin of error for the sample proportion, 
86/202. 


5.3.21. Rewrite Definition 5.3.1 to cover the case where 
a finite correction factor needs to be included (i.e., situa- 
tions where the sample size n is not negligible relative to 
the population size N). 


5.3.22. A public health official is planning for the supply 
of influenza vaccine needed for the upcoming flu season. 
She took a poll of 350 local citizens and found that only 
126 said they would be vaccinated. 


(a) Find the 90% confidence interval for the true pro- 
portion of people who plan to get the vaccine. 

(b) Find the confidence interval, including the finite cor- 
rection factor, assuming the town’s population is 
3000. 


5.3.23. Given that n observations will produce a bino- 
mial parameter estimator, *, having a margin of error 
equal to 0.06, how many observations are required for the 
proportion to have a margin of error half that size? 


5.3.24. Given that a political poll shows that 52% of the 
sample favors Candidate A, whereas 48% would vote for 
Candidate B, and given that the margin of error associated 
with the survey is 0.05, does it make sense to claim that the 
two candidates are tied? Explain. 

5.3.25. Assume that the binomial parameter p is to be 
estimated with the function = where X is the number 
of successes in n independent trials. Which demands the 
larger sample size: requiring that * have a 96% probabil- 
ity of being within 0.05 of p, or requiring that * have a 
92% probability of being within 0.04 of p? 


5.3.26. Suppose that p is to be estimated by * and we are 
willing to assume that the true p will not be greater than 
0.4. What is the smallest n for which * will have a 99% 
probability of being within 0.05 of p? 


5.3.27. Let p denote the true proportion of college stu- 
dents who support the movement to colorize classic films. 
Let the random variable X denote the number of stu- 
dents (out of n) who prefer colorized versions to black and 
white. What is the smallest sample size for which the prob- 
ability is 80% that the difference between * and p is less 
than 0.02? 


5.3.28. University officials are planning to audit 1586 new 
appointments to estimate the proportion p who have been 
incorrectly processed by the payroll department. 


(a) How large does the sample size need to be in order 
for *, the sample proportion, to have an 85% chance 
of lying within 0.03 of p? 

(b) Past audits suggest that p will not be larger than 0.10. 
Using that information, recalculate the sample size 
asked for in part (a). 
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Example 
5.4.1 


5.4 Properties of Estimators 


The method of maximum likelihood and the method of moments described in 
Section 5.2 both use very reasonable criteria to identify estimators for unknown 
parameters, le the two do not always yield the same ee For example, given 
that Y), Y,..., Y, isa random sample from the pdf fy (y; 0) = a ,0<y <6, the max- 


imum likelihood estimator for @ is 6 = Ymax While the method of moments estimator 
is 6 = 3Y 5Y. (See Questions 5.2.12 and 5.2.15.) Implicit in those two formulas is an 
obvious question—which should we use? 

More generally, the fact that parameters have multiple estimators (actually, 
an infinite number of 6’s can be found for any given 6) requires that we investi- 
gate the statistical properties associated with the estimation process. What qualities 
should a “good” estimator have? Is it possible to find a “best” 0? These and other 
questions relating to the theory of estimation will be addressed in the next several 
sections. 

To understand the mathematics of estimation, we must first keep in mind 
that every estimator is a function of a set of random variables—that is, 6 = 
A(%, Yo,..., Yn). As such, any 6, itself, is a random variable: It has a pdf, an 
expected value, and a variance, all three of which play key roles in evaluating its 
capabilities. 

We will denote the pdf of an estimator (at some point wu) with the symbol 
fa(u) or pa(u), depending on whether 6 is a continuous or a discrete random vari- 
able. Probability calculations involving 6 will reduce to integrals of fj(u) (if 6 is 
continuous) or sums of p(w) (if 6 is discrete). 


a. Suppose a coin, for which p = P (heads) is unknown, as to be tossed ten times for 
the purpose of estimating p with the function p= x, where X is the observed 
number of heads. Jf p = 0.60, what is the probability that |x - 0.60] < 0.10? 
That is, what are the chances that the estimator will fall wathin 0.10 of the true 
value of Ae VareMce Here / is discrete—the only values * jo can take on are 

oie -+ +> 1p. Moreover, when p = 0.60, 


10’ 10° °° 
mas P(p x = P(X =k) a (0.60)*(0.40)!°-*, k=0,1 10 
Pp 10 — p= 10 = k a . ’ ee RTF SS, 
Therefore, 
X X 
p (| -0.60 <0.10) = P (0.60- 0.10 < TES <0.60+0.10) 
= P(5<X <7) 
SAG 
= es ( :) (0.60)* (0.40) !°-* 
k=5 
= 0.6665 


b. How likely is the estimator ~ to lie within 0.10 of p if the coin in part (a) is 
tossed one hundred times? Given that n is so large, a Z transformation can be 
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(Approximate) Dist” of 


| a when p = 0.60 


Area = 0.6665 


- Area = 0.9586 
Dist” of T0 


when p = 0.60 \_ 


0 O1 02 03 04 05 06 0.7 08 09 1 
Values of X/n 


Figure 5.4.1 


used to approximate the variation in 4. Since E (*) = p and Var (*) = p(1 — 
p)/n, we can write 


xX 
—_ — <0.1 50 < — <0.7 
p(|3 0.60} <0. 0) = p (0.50 100 <2 0) 


0.50 — 0.60 Z X /100 — 0.60 B 0.70 — 0.60 
(0.60)(0.40) (0.60)(0.40) ri (0.60) (0.40) 


100 100 100 
+ P(—2.04 < Z <2.04) 
= 0.9586 


Figure 5.4.1 shows the we probanines just calculated as areas under the prob- 
ability functions describing + = and a As we would expect, the larger sample size 
produces a more precise -sanmior with n= 10, * jo has only a 67% ee of lying in 
the range from 0.50 to 0.70; for n = 100, though, the probability of * Zo falling within 
0.10 of the true p (= 0.60) increases to 96%. 

Are the additional ninety observations worth the gain in precision that we see in 
Figure 5.4.1? Maybe yes and maybe no. In general, the answer to that sort of ques- 
tion depends on two factors: (1) the cost of taking additional measurements, and 
(2) the cost of making bad decisions or inappropriate inferences because of inaccu- 
rate estimates. In practice, both costs— especially the latter—can be very difficult to 
quantify. = 


Unbiasedness 


Because they are random variables, estimators will take on different values from 
sample to sample. Typically, some samples will yield 0,’s that underestimate 6 while 
others will lead to @,’s that are numerically too large. Intuitively, we would like the 
underestimates to somehow “balance out” the overestimates —that is, 6 should not 
systematically err in any one particular direction. 

Figure 5.4.2 shows the pdfs for two estimators, 6, and @). Common sense tells us 
that 6; is the better of the two because fj, is centered with respect to the true 0; 


6, on the other hand, will tend to give estimates that are too large because the bulk 
of fy, (u) lies to the right of the true 6. 
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Figure 5.4.2 


fy) fa) 


True @ True 6 


Definition 5.4.1. Suppose that Y\, ¥5,...,¥, is a random sample from the 
continuous pdf fy(y; 9), where 6 is an unknown parameter. An estimator 
6 [=h(%, Yo,..., Yn)] is said to be unbiased (for 0) if E(6) =6 for all 6. [The 
same concept and terminology apply if the data consist of a random sample 
X,, X2,..., X, drawn from a discrete pdf px (k; @)]. 


Example It was mentioned at the outset of ms section that 6, = 3Y and 6) = Ymax are two 
5.4.2 estimators for 6 in the pdf fy(y; 6) = a F . <y <0. Are dither or both unbiased? 
First we need E(Y), which is foy : > sdy = 20. Then using the properties of 
expected values, we can show that 6; is unbiased for all 6: 


B@)=6 (57) =3 E(Y)==E(Y S26 
(@)=E(5 SEW) =5 S55 


The maximum likelihood estimator, on the other hand, is obviously biased — 
since Ymax is necessarily less than or equal to 9, its pdf will not be centered with 
respect to 0, and E(Ymax) will be less than 0. The exact factor by which Ymax tends to 
underestimate 0 is readily calculated. Recall from Theorem 3.10.1 that 


Ping (Y) =n Fy (y)" | fy) 


The cdf for Y is 

ros ae e 

» 02 62 
Then 
y? = Dy ats ge 
fans) =" (3) 92 9m ,0<y<0 
Therefore, 
Ba [> Ty ay tay Of | dy= a 
Q2n 2n 62” In+1 2n+1 

jim Boat a =. 6=90. Intuitively, this decrease in the bias makes sense because fp, becomes 
increasingly concentrated around @ as n grows. = 


Comment For any finite n, we can construct an estimator based on Yinax that is 

unbiased. Let 6; = th. Y..ax. Then 

2n+1 _2n +1 2n+1 2n 
- Yinax = On 


E(Ymax) = . 6=0 
2n ( )= 2n 2n+1 


£0) = E( 
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Example Let X,, X2,..., X, be arandom sample from a discrete pdf px (k; 6), where 6 = E(X) 
5.4.3 is an unknown parameter. Consider the estimator 


where the a;’s are constants. For what values of a1, a2,..., dn Will 6 be unbiased? 
By assumption, 6 = E(X), so 


B@)=8( Sra) 
= LabX = La 
i=1 
295% 
i=1 


‘ n 
Clearly, 6 will be unbiased for any set of a;’s for which )° a; = 1. a 
i=l 
Example Given a random sample ¥;, Y2,..., Y, from a normal distribution whose parameters 
5.4.4 p and o” are both unknown, the maximum likelihood estimator for o? is 


1 n 
a2 2 
=-)\(%,-Y 
n 2 ) 


(recall Example 5.2.4). Is 7 unbiased for o7? If not, what function of é? does have 
an expected value equal to 07? 

Notice, first, from Theorem 3.6.1 that for any random variable Y, Var(Y) = 
E(Y*) —[E(Y)]’. Also, from Section 3.9, for any average, Y, of a sample of n random 
variables, Y, Y2,..., Y,, E(Y) = E(¥;) and Var(Y) = (1/n)Var(Y;). Using those results, 
we can write 


: 1 n = 
E@’)=E E > (¥;- ry 
1 n = 5 
7 (e-2nF +P) 
n 

i=1 


| 

ne] }(Sue-ar) 
“i [2 
“th 


Il 
cs) 


ye (¥?) )- new | 
n - 


Pm 


1 


o 2 
> +p”) —n(— + 1”) 
= n 
n—-1 o2 


i 


Since the latter is not equal to a G? is biased. 
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To “unbias” the maximum likelihood estimator in this case, we need simply mul- 
tiply 6? by +. By convention, the unbiased version of the maximum likelihood 
estimator for o” in a normal distribution is denoted S? and is referred to as the 


sample variance: 


S* =sample variance = 


n—-1 


Comment The square root of the sample variance is called the sample standard 


deviation: 


i = 
S =sample standard deviation= | —— > (Y; — Y)? 
n—-1 = 


In practice, S is the most commonly used estimator for o even though E(S) #0 
[despite the fact that E(S*) =07]. a 


Questions 


5.4.1. Two chips are drawn without replacement from an 
urn containing five chips, numbered 1 through 5. The aver- 
age of the two drawn is to be used as an estimator, 6, 
for the true average of all the chips (0 = 3). Calculate 
P(\6 —3|> 1.0). 


5.4.2. Suppose a random sample of size n = 6 is drawn 
from the uniform pdf f;(y; 6) = 1/0,0 < y <6, for the 
purpose of using 6= Ymax tO estimate 6. 


(a) Calculate the probability that 6 falls within 0.2 of 6 
given that the parameter’s true value is 3.0. 

(b) Calculate the probability of the event asked for in 
part (a), assuming the sample size is 3 instead of 6. 


5.4.3. Five hundred adults are asked whether they favor 
a bipartisan campaign finance reform bill. If the true pro- 
portion of the electorate in favor of the legislation is 52%, 
what are the chances that fewer than half of those in the 
sample support the proposal? Use a Z transformation to 
approximate the answer. 


5.4.4. A sample of size n = 16 is drawn from a normal dis- 
tribution where o = 10 but yw is unknown. If ~ = 20, what 
is the probability that the estimator (= Y will lie between 
19.0 and 21.0? 


5.4.5. Suppose X,, X2,..., X, isarandom sample of size n 
drawn from a Poisson pdf where A is an unknown param- 
eter. Show that 4 = X is unbiased for 4. For what type of 
parameter, in general, will the sample mean necessarily be 


an unbiased estimator? (Hint: The answer is implicit in the 
derivation showing that X is unbiased for the Poisson A.) 


5.4.6. Let Yin be the smallest order statistic in a random 
sample of size n drawn from the uniform pdf, fy(y; 6) = 
1/6,0< y <@. Find an unbiased estimator for 6 based on 
Yinin- 


5.4.7. Let Y be the random variable described in 
Example 5.2.3, where fy(y,6) =e", y>6, 6 > 0. Show 
that Yinin — 1 is an unbiased estimator of 0. 


5.4.8. Suppose that 14, 10, 18, and 21 constitute a random 
sample of size 4 drawn from a uniform pdf defined over 
the interval [0, 8], where 6 is unknown. Find an unbiased 
estimator for 6 based on Yj, the third order statistic. What 
numerical value does the estimator have for these partic- 
ular observations? Is it possible that we would know that 
an estimate for 6 based on Y; was incorrect, even if we had 
no idea what the true value of 9 might be? Explain. 


5.4.9. A random sample of size 2, Y, and Y>, is drawn from 
the pdf 


fr(y; 8) = 2y6?, 
What must c equal if the statistic c(Y; + 2Y,) is to be an 
unbiased estimator for +? 


1 
0 = 
ae ee 


5.4.10. A sample of size 1 is drawn from the uniform pdf 
defined over the interval [0, 6]. Find an unbiased estimator 
for 6. (Hint: Is 6 = Y? unbiased?) 


5.4.11. Suppose that W is an unbiased estimator for 0. 
Can W’ be an unbiased estimator for 67? 


2 


5.4.12. We showed in Example 5.4.4 that 6? = 


wes = Y) is biased for o”. Suppose jz is known and does 
i=l 


not have to be estimated by Y. Show that 6? = -~- py 
i=l 
is unbiased for o°. 


5.4.13. As an alternative to imposing unbiasedness, an 
estimator’s distribution can be “centered” by requiring 
that its median be equal to the unknown parameter 0. If it 
is, 6 is said to be median unbiased. Let Y,,¥,...,Y, bea 
random sample of size n from the uniform pdf, fy (y; 0) = 
1/0,0< y <9. For arbitrary n, is 6= wel - Ymax Median 
unbiased? Is it median unbiased for any value of n? 


Efficiency 
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5.4.14. Let Y,, %,..., ¥, be a random sample of size n 
from the pdf fy(@v; 6) = re /4, y >0. Let 6=n- Ymin- Is 6 


unbiased for 0? Is 6 = 1” Y; unbiased for 6? 
i=1 


5.4.15. An estimator 6, = A(W,,..., Wn) is said to be 
asymptotically unbiased for 6 if lim E(6,) =@. Suppose W 
is a random variable with E(W) = uw and with variance 
o*. Show that W’ is an asymptotically unbiased estimator 
for 2’. 


5.4.16. Is the maximum likelihood estimator for o? in a 
normal pdf, where both y and o? are unknown, asymptot- 
ically unbiased? 


As we have seen, unknown parameters can have a multiplicity of unbiased estima- 
tors. For samples drawn from the uniform pdf, fy (y; 6) =1/0,0< y <0, for example, 


both 6 = 2+! 


n 


which we choose? 


re n 
- Ymax and 6 = 2 >> Y; have expected values equal to 6. Does it matter 


i=1 


Yes. Unbiasedness is not the only property we would like an estimator to have; 
also important is its precision. Figure 5.4.3 shows the pdfs associated with two hypo- 
thetical estimators, 6, and 6). Both are unbiased for 0, but 6 is clearly the better of 
the two because of its smaller variance. For any value r, 


P(0—r<6)<0+r)>P@—r<6, <O+r) 


That is, @) has a greater chance of being within a distance r of the unknown 6 than 


does 6}. 


Definition 5.4.2. Let 61 and 6 be two unbiased estimators for a parameter 0. If 


we say that 6, is more efficient than 6). Also, the relative efficiency of 6, with 
respect to 6 is the ratio Var(2) /Var(6)). 


Var(6,) < Var(@>) 


Figure 5.4.3 


, “a he, (u) 


vi \ 
\ 


A 
! \ ~ol<r 
; me, O0|<r) 
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Example 
5.4.5 


Let Y;, Y2, and Y3; be a random sample from a normal distribution where both 
j and o are unknown. Which of the following is a more efficient estimator 
for 2? 


1 
i;=—-Y,+-Y4+-Y. 
by Figs ae ae a 
or 
Ti ty a bye by 
Ma=gtl 372 373 


Notice, first, that both jz; and ji2 are unbiased for yu: 


1 1 1 
E(a,;)=E| -Y ~Y. —Y 
(141) (Grt5n+; s) 


i 1 1 
= -E(Y\)+ =E(¥2)+ gE) 


4 2 
fel ai 
= Ght set GH 
=U 


and 


1 1 1 
E(ftv)=E| -Y ~Y. ~Y 
(f42) (5 i+; ats ») 


1 1 1 
= -E(Y;)+ =E(¥2)+ 3 Es) 


3 3 
ne ee 
aL 
=U 


But Var({i2) < Var({i1) so fiz is the more efficient of the two: 


1 1 1 
Var(j1,) = V: -Y, ~Y. —Y. 
ar({11) a (; i+; ar s) 


1 1 1 
— 76 var(h) + q Var(¥2) + 6 ar) 
_ 307 


8 


1 1 1 
Var(j12) = V: 4 ~yY. ~Y. 
ar({12) a(; +3 ats a) 


1 1 1 
— 5) + 9 Var(¥2) + g Vartts) 
_ 30? 

~ 9 


(The relative efficiency of 12 to (1, is 


3072 [302 
8 9 


or 1.125.) = 
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Example Let Y,,..., Y, be a random sample from the pdf fy(y; 6) = 3, 0<y <0. We know 
5.4.6 from Example 5.4.2 that 6, = 3Y and i, = “aol Ymax are both unbiased for 6. Which 
estimator is more efficient? _ 
First, let us calculate the variance of 6, = 3Y . To do so, we need the variance of 
Y. To that end, note that 


eg ep 2 £8 2 @ | 
E(Y?)= 4 yo iS. — SF? 
w= |» Sw=5 | Ya=5- T=; 
and 
1 2\? ¢ 
Var(Y) = E(Y?) — E(Y)* = 0? ¢) = 
ar(Y) (Y*) (Y) 5 (52) ig 
Then 


‘ 3 9 — 9 Var(Y 9 @ 6 
Var(0,;) = Var { ~Y ) = —Var(Y) = ar( )_ 
2 4 4 n  4n 18 8n 


To address the variance of 6, = ant} Ymax, We Start with finding the variance of 
Ymax- Recall that its pdf is 


nFy(y)"— begs = ae 1 0<y<0 


on 
From that expression, we obtain 


7 2n Qn f° Qn gent? n 
E(y2 = 2. 2n— ‘dy Se yrtldy = . = @? 
(Yinax) = [> "pan? Q2n v= it Dy +2 n+l 
and then 
2 
> > no» 2n n 9 
Var(Ymax) = = E(Yinax) — E(¥max) = 0 0 = 0 
n+1 2n+1 (n+ 1)(2n+ 1)? 
Finally, 
n+1 2n+1/ 2n +1)? 
Wij lS ee eS " @2 
4n?2 4n?2 (n+ 1)(Qn +1)? 
= 1! 2 
~ 4n(n+ 1) 

Note that Var(6>) = mary? a,0= Var(6,) for n > 1, so we say that 4; is more 
efficient than 6,. The relative efficiency of 6) with respect to 6, is the ratio of their 
variances: 

Vari) 1 oy 1 > 4n(n+1)_ (ntl) 
Var(é,)  8n '4n(n+1) 8m 2 
= 
Questions 
5.4.17. Let X,, Xo,..., X, denote the outcomes of a series (a) Show that p,; = X, and p,= * are unbiased estima- 
of n independent trials, where tors for p. 


. 2 (b) Intuitively, p pz is a better estimator than p, because 
X;= {6 with Probabily P Pp fails to include any of the information about the 
0 with probability 1 — p parameter contained in trials 2 through n. Verify that 


fori= 1,9..." let XH kX, 43, SX, speculation by comparing the variances of p; and p>. 
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5.4.18. Suppose that n =5 observations are taken from 
the uniform pdf, fy(v; @) = 1/0,0 < y < 6, where @ is 
unknown. Two unbiased estimators for 6 are 


~ 6 
0 = 23 Yinax 
5 
Which estimator would be better to use? [Hint: What must 
be true of Var(Ynax) and Var(Ynin) given that fy(y; @) is 
symmetric?] Does your answer as to which estimator is 
better make sense on intuitive grounds? Explain. 


5.4.19. Let Y,, Y%,...,¥, be a random sample of size n 
from the pdf f,(y; 6) = re? y>0. 


and 6, =6- Ymnin 


(a) Show that 6, = Y;,6. =Y, and 6; =n Ymin are all 
unbiased estimators for 0. : 
(b) Find the variances of 6,, 02, and 63. 


5.4.20. Given a random sample of size n from a Pois- 
son distribution, 4, = X; and A, = X are two unbiased 
estimators for A. Calculate the relative efficiency of A, 
to das 


5.4.21. If Y,, Y.,..., Y, are random observations from a 
uniform pdf over [0,0], both 6, = (24) Yinax and 6, = 
(n+ 1). Yin are unbiased estimators for 6. Show that 
Var(,) /Var(6,) =n?. 


5.4.22. Suppose that W, is a random variable with mean 
w and variance of and W, is a random variable with 
mean yw and variance o3. From Example 5.4.3, we know 
that cW, + (1 —c)W, is an unbiased estimator of uw for 
any constant c > 0. If W,; and W, are independent, for 


(c) Calculate the relative efficiencies of 6, to 6, and 65 


to 63. 


Theorem 
5.5.1 


what value of c is the estimator cW, + (1 — c)W, most 
efficient? 


5.5 Minimum-Variance Estimators: The Cramér-Rao 
Lower Bound 


Given two estimators, 6, and 6), each unbiased for the parameter 6, we know from 
Section 5.4 which is “better” —the one with the smaller variance. But nothing in that 
section speaks to the more fundamental question of how good 6, and 4) are relative 
to the infinitely many other unbiased estimators for 0. Is there a 63, for example, 
that has a smaller variance than either 6 or 6 has? Can we identify the unbiased 
estimator having the smallest variance? Addressing those concerns is one of the 
most elegant, yet practical, theorems in all of mathematical statistics, a result known 
as the Cramér-Rao lower bound. 

Suppose a random sample of size n is taken from, say, a continuous probability 
distribution fy(y; 0), where 6 is an unknown parameter. Associated with fy(y; 6) is 
a theoretical limit below which the variance of any unbiased estimator for 6 cannot 
fall. That limit is the Cramér-Rao lower bound. If the variance of a given 6 is equal 
to the Cramér-Rao lower bound, we know that estimator is optimal in the sense that 
no unbiased 6 can estimate 6 with greater precision. 


(Cramér-Rao Inequality.) Let fy (y; 6) be a continuous pdf with continuous first-order 
and second-order derivatives. Also, suppose that the set of y values, where fy (y; 0) £0, 
does not depend on 0. 

Let Y,, Y2,..., Yn, be arandom sample from fy(y; 6), and let6=h(Y,, Yo,..., Yn) 
be any unbiased estimator of 6. Then 


. 27)! 2 . 2i 
vey |e | (ae) i -| ne? BEG) 
00 962 


[A similar statement holds if the n observations come from a discrete pdf, px (k; 8)]. 


Proof See (93). 
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Example Suppose the random variables X,, X2,...,X, denote the number of successes 
5.5.1 (0 or 1) in each of n independent trials, where p= P (Success occurs at any given trial) 
is an unknown parameter. Then 


px, (k; p)= p*— p)'*, k=0,1; O<p<1 
X 


Let X = X,+ X.+---+X, = total number of successes and define p = +. Clearly, 


PA 


p is unbiased for p [E® =E(*)=44 =¥= P|. How does Var(p) compare with 


the Cramér-Rao lower bound for px, (k; p)? 
Note, first, that 


Var‘) = Var (>) = + Var(X) = np jot) 
n n n n 


(since X is a binomial random variable). To evaluate, say, the second form of the 
Cramér-Rao lower bound, we begin by writing 


In px, (Xi; p) = Xi; lnp+ (1 — X;) Ind — p) 


Moreover, 
dln px,(Xi; p) _ Xi 1— xX; 
op ps l=p 
and 
a7 In px,(Xj; pp) X; = 1-X; 
dp? p (lL—p) 
Taking the expected value of the second derivative gives 
p[ eens P| PEP) >. 1 
dp? p> (l1-py pt p) 


The Cramér-Rao lower bound, then, reduces to 
1 pp) 
1 n 
mes [- aa 


which equals the variance of p = x. It follows that x is the preferred statistic for 
estimating the binomial parameter p: No unbiased estimator can possibly be more 
precise. 


Definition 5.5.1. Let © denote the set of all estimators 6 = h(¥, Yo,..-, Yn) 
that are unbiased for the parameter 6 in the continuous pdf fy(y; 0). We say 
that 6* is a best (or minimum-variance) estimator if 6* € © and 


Var(6*) < Var(6) forall 6€0 


[Similar terminology applies if © is the set of all unbiased estimators for the 
parameter © in a discrete pdf, px (k; 6)]. 


Related to the notion of a best estimator is the concept of efficiency. The connec- 
tion is spelled out in Definition 5.5.2 for the case where 6 is based on data coming 
from a continuous pdf fy(y; 6). The same terminology applies if the data are a set 
of X;’s from a discrete pdf px (k; @). 
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Example 
5.5.2 


Definition 5.5.2. Let Y,, ¥5,..., ¥, be a random sample of size n drawn from 
the continuous pdf fy(y; @). Let 9 =h(%, Y2,..., Y,) be an unbiased estimator 
for 0. 


a. The unbiased estimator 6 is said to be efficient if the variance of 6 equals 
the Cramér-Rao lower bound associated with fy (y; @). 

b. The efficiency of an unbiased estimator @ is the ratio of the Cramér-Rao 
lower bound for fy(y; 6) to the variance of é. 


Comment The designations “efficient” and “best” are not synonymous. If the vari- 
ance of an unbiased estimator is equal to the Cramér-Rao lower bound, then that 
estimator by definition is a best estimator. The converse, though, is not always true. 
There are situations for which the variances of no unbiased estimators achieve the 
Cramér-Rao lower bound. None of those, then, is efficient, but one (or more) could 
still be termed best. For the independent trials described in Example 5.5.1, p = ~ is 
both efficient and best. ] 


If Y;, Y2,...,¥, is a random sample from fy(y; 6) =2y/62,0<y<06,0= 3Y is an 
unbiased estimator for 6 (see Example 5.4.2). Show that the variance of @ is less 
than the Cramér-Rao lower bound for fy(y; @). 

From Example 5.4.6, we know that 


~ 
Var(0) = 7 


To calculate the Cramér-Rao lower bound for fy(y; 0), we first note that 


In fy (Y; 0) =In(2Y0~*) =In2Y — 21nd 


and 
dln fy(Y;0)  —2 
a6 a) 
Therefore, 
dln fy(Y; 0) 7 4 oa 9D 
00 62 0 62 62 
~ @2 
and 


| (a I ey 
ie) ae 
00 4n 


Is the variance of 6 less than the Cramér-Rao lower bound? Yes, is < a Is the 
statement of Theorem 5.5.1 contradicted? No, because the theorem does not apply 
in this situation: The set of y’s where fy(y; 6) £0 is a function of 9, a condition that 


violates one of the Cramér-Rao assumptions. o 


Questions 


5.5.1. Let ¥,,¥,...,Y, be a random sample from 
fr(y; 0) = ¢e/*, y > 0. Compare the Cramér-Rao lower 
bound for f(y; 6) to the variance of the maximum like- 


n 


lihood estimator for 6,4 = ‘> Y;. Is Y a best estimator 


i=l 
for 0? 


5.5.2. Let Xi, X>, a 


n from the Poisson distribution, px(k;’) = eae — 


.,X, be a random sample of size 


0,1,.... Show that i= ‘> X; is an efficient estimator 
i=1 


for i. 


5.5.3. Suppose a random sample of size n is taken from a 
normal distribution with mean jy and variance o”, where 
o” is known. Compare the Cramér-Rao lower bound for 


fr(y; 4) with the variance of # = Y =+1)°Y;. Is Y an 
i=l 


efficient estimator for jw? 


5.5.4. Let Y;,¥%,...,¥, be a random sample from the 
uniform pdf f,(y;@) = 1/0, 0 < y < 6. Compare the 
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Cramér-Rao lower bound for f(y; 6) with the variance of 
the unbiased estimator 6 = 2+! - Y,,,,. Discuss. 


n 


5.5.5. Let X have the pdf fy(k; 0) = “1k =1,2,3,..., 
@ > 1, which is geometric (p = 1/0). For this pdf E(X) =0 
and Var(X) = 6(6 — 1) (see Theorem 4.4.1). Is the statistic 
X efficient? 


5.5.6. Let Y,, Y, “38 
the pdf 


., Y, be arandom sample of size n from 


1 
fr(y; 0) = ————~ ye", 


0 
le" oe 


(a) Show that 6 = +Y is an unbiased estimator for 0. 


(b) Show that 6 = ‘Y is a minimum-variance estimator 
for 0. 


5.5.7. Prove the equivalence of the two forms given for 
the Cramér-Rao lower bound in Theorem 5.5.1. [Hint: Dif- 
ferentiate the equation fee fv(y) dy =1 with respect to 0 
and deduce that f™ “2 f,(y)dy = 1. Then differenti- 
ate again with respect to 6.] 


5.6 Sufficient Estimators 


Statisticians have proven to be quite diligent (and creative) in articulating properties 
that good estimators should exhibit. Sections 5.4 and 5.5, for example, intro- 
duced the notions of an estimator being unbiased and having minimum variance; 
Section 5.7 will explain what it means for an estimator to be “consistent.” All of 
those properties are easy to motivate, and they impose conditions on the proba- 
bilistic behavior of 6 that make eminently good sense. In this section, we look at a 
deeper property of estimators, one that is not so intuitive but has some particularly 
important theoretical implications. 

Whether or not an estimator is sufficient refers to the amount of “information” 
it contains about the unknown parameter. Estimates, of course, are calculated using 
values obtained from random samples [drawn from either px(k; 0) or fy(y; 0)]. If 
everything that we can possibly know from the data about 0 is encapsulated in the 
estimate 6,, then the corresponding estimator 6 is said to be sufficient. A comparison 
of two estimators, one sufficient and the other not, should help clarify the concept. 


An Estimator That Is Sufficient 


Suppose that a random sample of size n—X, =k,, Xx =ko,..., Xn =k, —is taken 


from the Bernoulli pdf, 
px(k; p)=p*— p)'*, k=0,1 


where p is an unknown parameter. We know from Example 5.1.1 that the maximum 
likelihood estimator for p is 
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n 
[and the maximum likelihood estimate is p, = (+) >- k;]. To show that p is a suf- 
i=l 
ficient estimator for p requires that we calculate the conditional probability that 
X,=k,,...,X,=k, given that p= p,.. 
Generalizing the Comment following Example 3.11.3, we can write 


bes P(X,=k,...,XnpH=knN p= ) 
P(X, =ky,..., Xn =kn | P= Pe) = n nll P= De 


P(p= Pe) 
= P(X, =k,,...,Xn=kn) 
P(p= Pe) 
But 
POH hy Xe =k) = =p)” 8p? =p)” 
e i n— > kj 
=p! (l-p) = 
= p"(1— py" 
and 


Gey n Nn n—n. 
pp=ro=r(Soximnr.)=(," )p™a-p) a 
i=l © 


since )~ X; has a binomial distribution with parameters n and p (recall Example 
i=l 
3.9.3). Therefore, 
pr(1— pyime 
a) Pape (e.) 


Notice that P(X; =k,..., Xn =kn | Pp = pe) is not a function of p. That is pre- 


laa (5.6.1) 


n 
cisely the condition that makes p = (+) )> X; a sufficient estimator. Equation 5.6.1 
i=l 


says, in effect, that everything the data can possibly tell us about the parameter 
p is contained in the estimate p,. Remember that, initially, the joint pdf of the 
sample, P(X; =k,..., Xn =k,), is a function of the k;’s and p. What we have just 
shown, though, is that if that probability is conditioned on the value of this partic- 
ular estimate —that is, on p = p.—then p is eliminated and the probability of the 


sample is completely determined [in this case, it equals (a) : where ( =) is the 
number of ways to arrange the 0’s and 1’s in a sample of size n for which p = p,]. 

If we had used some other estimator—say, p*—and if P(X, =k,..., Xn =kn | 
p* = p=) had remained a function of p, the conclusion would be that the information 
in p; was not “sufficient” to eliminate the parameter p from the conditional prob- 
ability. A simple example of such a p* would be p* = X;. Then p* would be k; and 
the conditional probability of X; =k,,...,X, =k, given that p* = p* would remain 
a function of p: 


ki n- kj 
per (legs 
pip) 


n n 
n-1-)* kj 


‘ Lk 
PX Shi 0i5 Xp= PnP Hk) = =po (ap) + ts 
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Comment Some of the dice problems we did in Section 2.4 have aspects that paral- 
lel to some extent the notion of an estimator being sufficient. Suppose, for example, 
we roll a pair of fair dice without being allowed to view the outcome. Our objective 
is to calculate the probability that the sum showing is an even number. If we had no 
other information, the answer would be 3. Suppose, though, that two people do see 
the outcome—which was, in fact, a sum of 7—and each is allowed to characterize 
the outcome without providing us with the exact sum that occurred. Person A tells 
us that “the sum was less than or equal to 7”; Person B says that “the sum was an 
odd number.” 

Whose information is more helpful? Person B’s. The conditional probability of 
the sum being even given that the sum is less than or equal to 7 is 4, which still 
leaves our initial question largely unanswered: 


P(2)+ P(4)+ P(6) 
P(2)+ P(3)+ P(4) + P(5)+ P(6) + P(7) 
= Be 
etutetet+st+s 
9 
=O" 


In contrast, Person B utilized the data in a way that definitely answered the original 
question: 


P(Sum is even | sum <7) = 


P(Sum is even | Sum is odd) =0 


In a sense, B’s information was “sufficient”; A’s information was not. 


An Estimator That Is Not Sufficient 


Suppose a random sample of size n—Y,, Y2,...,Y,—is drawn from the pdf 
fro; A= 3, 0 < y <0, where @ is an unknown parameter. Recall that the method 
of moments estimator is 


This statistic is not sufficient because all the information in the data that pertains to 
the parameter 6 is not necessarily contained in the numerical value 6,. 

If 6 were a sufficient statistic, then any two random samples of size n having 
the same value for 0, should yield exactly the same information about 6. How- 
ever, a simple numerical example shows this not to be the case. Consider two 
random samples of size 3—y,; = 3, yo =4, y3=5 and y; = 1, yo =3, y3 = 8. In both 
cases, 


Do both samples, though, convey the same information about the possible value of 
0? No. Based on the first sample, the true 6 could, in fact, be equal to 4. On the 
other hand, the second sample rules out the possibility that 6 is 4 because one of the 
observations (y3 = 8) is larger than 4, but according to the definition of the pdf, all 
Y;’s must be less than 6. 
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Example 
5.6.1 


A Formal Definition 


Suppose that X,; =k,,..., X, =k, is a random sample of size n from the discrete pdf 

px(k; 0), where @ is an unknown parameter. Conceptually, @ is a sufficient statistic 

for 6 if 

POG SK iin Ka= hy OO = 6.) 
P(6=6) 

I] Dx (ki; @) 


i=1 
Ee he) 5.6.2 
pp (Ge; 9) ; One) 


where p, (4:0) is the pdf of the statistic evaluated at the point 6=6, and 
b(ki,..., kn) is a constant independent of 6. Equivalently, the condition that qualifies 
a statistic as being sufficient can be expressed by cross-multiplying Equation 5.6.2. 


Definition 5.6.1. Let X; =k,...,X,=k, be a random sample of size n from 
px(k; 8). The statistic 6=h(X1,..., Xp) is sufficient for @ if the likelihood func- 
tion, L(@), factors into the product of the pdf for 6 and a constant that does not 
involve @ —that is, if 


LO) = | px (kis 0) = pp (Ge; O)b(ki, ..-, kn) 


i=l 
A similar statement holds if the data consist of a random sample Y, = 
y1,---,» %,= yn drawn from a continuous pdf f(y; 4). 


Comment If 64 is sufficient for 0, then any one-to-one function of 6 is also a sufficient 
statistic for 9. As a case in point, we showed on p. 324 that 


is a sufficient statistic for the parameter p in a Bernoulli pdf. It is also true, then, that 


n 
panp= DX 
i=l 


is sufficient for p. 


Let X; =k,...,X, =k, be a random sample of size n from the Poisson pdf, 
px(k; A) =e*A*/k!, k =0, 1,2, .... Show that 


n 
wee 
i=l 
is a sufficient statistic for A. 


From Example 3.12.10, we know that 4, being a sum of n independent Poisson 
random variables, each with parameter A, is itself a Poisson random variable with 


Theorem 
5.6.1 
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parameter nd. By Definition 5.6.1, then, A is a sufficient statistic for A if the sample’s 
likelihood function factors into a product of the pdf for A times a constant that is 
independent of A. 

But 


ay=[Teras/ Ene PaE Te 
i=1 


n n 
Vk ki fm 
ey ist iat dk; )! 
i=l 


Pires A) b(ki, .-- kn) (5.6.3) 


x n 
proving that A= )~ X; is a sufficient statistic for A. 
i=l 


r, n 
Comment The factorization in Equation 5.6.3 implies that = }° X; is a sufficient 
statistic for A. It is not, however, an unbiased estimator for A: ‘=! 


EQ= SEX) = ye 
i=1 i=1 


Constructing an unbiased estimator based on the sufficient statistic, though, is a 
simple matter. Let 


7 (<2 
pe 


Then E(a*) = LEQ) a +nh =, so 4* is unbiased for A. Moreover, 4* is a one-to-one 
function of A, so, by the Comment on p. 326, h* is, itself, a sufficient estimator 
for A. a 


A Second Factorization Criterion 


Using Definition 5.6.1 to verify that a statistic is sufficient requires that the pdf 
palh(ki,.--,kn); @] or fglhOn,.--,¥n); 6] be explicitly identified as one of the two 
factors whose product equals the likelihood function. If 6 is complicated, though, 
finding its pdf may be prohibitively difficult. The next theorem gives an alternative 
factorization criterion for establishing that a statistic is sufficient. It does not require 
that the pdf for 6 be known. 


Let X,;=k,..., Xn =k, be a random sample of size n from the discrete pdf px(k; 9). 


A 


The statistic 0 = h(X,..., Xn) is sufficient for 6 if and only if there are functions 
g[h(ky,..., kn); 0] and b(ki, ...,k,) such that 
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Example 
5.6.2 


L(@)=glh(ky,..., kn); @]-b(ki,..., Kn) (5.6.4) 


where the function b(k,, ..., k,) does not involve the parameter 6. A similar statement 
holds in the continuous case. 


Proof First, suppose that @ is sufficient for 6. Then the factorization criterion of 
Definition 5.6.1 includes Equation 5.6.4 as a special case. 

Now, assume that Equation 5.6.4 holds. The theorem will be proved if it can 
be shown that g[b(k1,..., kn); 6] can always be “converted” to include the pdf of 6 
(at which point Definition 5.6.1 would apply). Let c be some value of the function 
b(k,,...,k,) and let A be the set of samples of size n that constitute the inverse image 
of c—that is, A=h7!(c). Then 


PCOV= PX Krak Kika kn= D> [] px) 


(ky ko pees ky eA (ky ,ka,..kn eA i=1 


= So g(cr0)-b(ki,ka---kn)= 8(C0)> D> (kis kaye n) 


(ky ,k2,-+- kn)eA (ky ko, kn eA 


Since we are interested only in points where pj(c; @) #0, we can assume that 
2 b(ki, kz, ..., kn) £0. Therefore, 


(ky .ka,.kn eA 
1 


bk, ka, +s kn) 
(ky ,ka,..kneA 


8(c; 8) = pg(c; 8) - (5.6.5) 


Substituting the right-hand side of Equation 5.6.5 into Equation 5.6.4 shows that 
6 qualifies as a sufficient statistic for 9. A similar argument can be made if the 
data consist of a random sample Y; = y,,..., Y, = y, drawn from a continuous pdf 
fy (y; 8). See (200) for more details. 


3,0<y<0. We know from 


Question 5.2.12 that the maximum likelihood estimator for 6 is 6 = Ymax- IS Ymax also 
sufficient for 6? 

Since the set of Y values where fy(y; 6) 40 depends on @, the likelihood func- 
tion must be written in a way to include that restriction. The device achieving that 
goal is called an indicator function. We define the function Ijo,¢)(y) by 


Suppose Y,,..., Y, is arandom sample from fy(y; 6) = 


1 O<y<ée 
0 otherwise 


Tio,ai() = | 


Then we can write fy(y; 0) = a - [io,0)() for all y. 
The likelihood function is 


n 


2 ; n 1 n 
L@=]] a *To,01 Yi) = (1 > (x) ] [40.0 
i=l i=1 


i=l 


But the critical fact is that 


n 
I] Tio,61 91) = [0,0] max) 


i=1 


Questions 


5.6.1. Let X,,X2,... 
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Thus the likelihood function decomposes in such a way that the factor involving 6 
contains only the y,’s through ymax: 


n 1 n I ‘nag n 
L@)= (1 > (=) ] [toni = [eae | (1 2 
i=1 i=1 i=1 


This decomposition meets the criterion of Theorem 5.6.1, and Ymax is sufficient for 6. 
(Why doesn’t this argument work for Yin?) = 


Sufficiency as It Relates to Other Properties of Estimators 


This chapter has constructed a rather elaborate facade of mathematical properties 
and procedures associated with estimators. We have asked whether 6 is unbiased, 
efficient, and/or sufficient. How we find 6 has also come under scrutiny —some esti- 
mators have been derived using the method of maximum likelihood; others have 
come from the method of moments. Not all of these aspects of estimators and 
estimation, though, are entirely disjoint—some are related and interconnected in 
a variety of ways. 

Suppose, for example, that a sufficient estimator 6s exists for a parameter 6, and 
suppose that 6y is the maximum likelihood estimator for that same 6. If, for a given 
sample, 6s =0,, we know from Theorem 5.6.1 that 


L(@) = 8 (Gc; 8) - b(ki, ..-, kn) 


Since the maximum likelihood estimate, by definition, maximizes L(@), it must also 
maximize g¢(6,; 0). But any 6 that maximizes g(6,; @) will necessarily be a function of 
6,. It follows, then, that maximum likelihood estimators are necessarily functions of 
sufficient estimators—that is, 64, = f (4s) (which is the primary theoretical justifica- 
tion for why maximum likelihood estimators are preferred to method of moments 
estimators). 

Sufficient estimators also play a critical role in the search for efficient 
estimators—that is, unbiased estimators whose variance equals the Cramér-Rao 
lower bound. There will be an infinite number of unbiased estimators for any 
unknown parameter in any pdf. That said, there may be a subset of those unbi- 
ased estimators that are functions of sufficient estimators. If so, it can be proved [see 
(93)] that the variance of every unbiased estimator based on a sufficient estimator 
will necessarily be less than the variance of every unbiased estimator that is not a 
function of a sufficient estimator. It follows, then, that to find an efficient estimator 
for 6, we can restrict our attention to functions of sufficient estimators for 0. 


,X, be a random sample of size is sufficient for p. Show that the linear combination p* = 


n from the geometric distribution, (Pxlk: p=a- 


p)'p,k = 1,2,.... Show that p = ae is sufficient 
i=1 
for p. 


5.6.2. Let X,, X2, and X; be a set of three indepen- 
dent Bernoulli random variables with unknown parameter 
p=P(X;=1). It was shown on p. 324 that p= X,+ X.+X3 


X,+2X,+3X; is not sufficient for p. 


5.6.3. If 6 is sufficient for 6, show that any one-to-one 
function of 6 is also sufficient for 6. 


5.6.4. Show that 6? = )°Y? is sufficient for o? if 


i=1 
Y,, Y2,..., Y, is a random sample from a normal pdf with 
w=O0. 
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5.6.5. Let Y,, Y2,..., Y, be arandom sample of size n from 
the pdf of Question 5.5.6, 


1 
a) oe r-l1 ~¥/6 
Fry; 8) @—Dier- é 


for positive parameter 0 and r a known positive integer. 
Find a sufficient statistic for 0. 


5.6.6. Let Y,, Y2,..., Y, be arandom sample of size n from 
the pdf fy (y; 6) = Oy?"!,0 < y < 1. Use Theorem 5.6.1 to 


show that W =]|Y, is a sufficient statistic for 0. Is the 
i=l 
maximum likelihood estimator of 6 a function of W? 


O<y 


5.6.7. Suppose a random sample of size n is drawn from 
the pdf 


fri =eO, O<y 


(a) Show that 6 = Ymin 1S sufficient for the threshold 
parameter 0. 


(b) Show that Yn, is not sufficient for 6. 


5.6.8. Suppose a random sample of size n is drawn from 
the pdf 


5.7 Consistency 


1 
fr Oa O<y<0 


Find a sufficient statistic for 6. 


5.6.9. A probability model gy(w;6) is said to be 
expressed in exponential form if it can be written as 


gw (Ww; O) = eK MP O+5()+4@) 


where the range of W is independent of 6. Show that 


6= > K(W;) is sufficient for 0. 

i=l 
5.6.10. Write the pdf f(y; 4) = Ae’, y > 0, in expo- 
nential form and deduce a sufficient statistic for 4 (see 
Question 5.6.9). Assume that the data consist of a random 
sample of size n. 


5.6.11. Let Y,, Y>, sk 
Pareto pdf, 


.,Y, be a random sample from a 


0<0<©oo 


fry; 0) =0/(+y)**!, O<y<oo; 


Write fy(y; 9) in exponential form and deduce a sufficient 
statistic for 0 (see Question 5.6.9). 


The properties of estimators that we have examined thus far—for instance, unbi- 
asedness and sufficiency—have assumed that the data consist of a fixed sample size 
n. It sometimes makes sense, though, to consider the asymptotic behavior of estima- 
tors: We may find, for example, that an estimator possesses a desirable property in 
the limit that it fails to exhibit for any finite n. 

Recall Example 5.4.4, which focused on the maximum likelihood estimator for 


es 


For any finite n, 6? is biased: 


n —y 
o? in a sample of size n drawn from a normal pdf [that is, on 67 = + Y> (Y; — Y)’]. 
i=l 


i = n—1 
oa ea 


As n goes to infinity, though, the limit of E(é*) does equal o”, and we say that 6? is 


asymptotically unbiased. 


Introduced in this section is a second asymptotic characteristic of an estima- 
tor, a property known as consistency. Unlike asymptotic unbiasedness, consistency 
refers to the shape of the pdf for 6, and how that shape changes as a function of n. 
(To emphasize the fact that the estimator for a parameter is now being viewed as a 
sequence of estimators, we will write 6, instead of 0.) 


Definition 5.7.1. An estimator 6, =h(W;, Wo,.. 
for 6 if it converges in probability to 6 —that is, if for all e > 0, 


., W,) is said to be consistent 


lim P(|6, —0|< 6)=1 
n—->Oo 


Example 
5.7.1 
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Comment To solve certain kinds of sample-size problems, it can be helpful to think 
of Definition 5.7.1 in an epsilon/delta context; that is, 6,, is consistent for 6 if for all 
é€>Oand6>0, there exists an n(e, 6) such that 


P(\O, -@|< e)>1—8 for n>n(e,d) 


Let Y\, Y2,..., ¥, be arandom sample from the uniform pdf 
1 
fry; 0) = 3° O<y<0 


and let 6, = Ymax. We already know that Ymax is biased for 6, but is it consistent? 
Recall from Question 5.4.2 that 


ny"! 
Firm V)= pa» “OSV SO 
Therefore, 
6 n—1 n|@ 
A A ny y 
P((d) 9 <s)= PO —e <4, <6)= [ dy= — 
A-é gn gn ae 


_s (¢ _ *) 
0 
Since [(@ —2)/0] < 1, it follows that [(@ — ©)/8]" > 0 as n > oo. Therefore, 
lim P(|6, —6| <¢)=1, proving that 6, = Ymax is consistent for 6. 
noo 


Figure 5.7.1 illustrates the convergence of 6,,. As n increases, the shape of fy,,,, 


(y) changes in such a way that the pdf becomes increasingly concentrated in an 
e-neighborhood of @. For any n > n(e, 5), P(|O, —8| <¢) >1—6. 


P(1d,-91<c) F1-8 P(16,-61<e)>1-5 


0 1 y) 3 n(e, 6) n 


Figure 5.7.1 


If 6, e, and 6 are specified, we can calculate n(e, 5), the smallest sample size that 
will enable 6, to achieve a given precision. For example, suppose 6 = 4. How large a 
sample is required to give 6, an 80% chance of lying within 0.10 of 6? 

In the terminology of the Comment on p. 331, e=0.10, 5 =0.20, and 


A 4—0.10\" 
P(\@—4| <0.10)=1 ( yet 0.20 


Therefore, 
(0.975)"©) = 0.20 
which implies that n(e, 6) = 64. = 
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Theorem 
5.7.1 


Example 
5.7.2 


A useful result for establishing consistency is Chebyshev’s inequality, which 
appears here as Theorem 5.7.1. More generally, the latter serves as an upper bound 
for the probability that any random variable lies outside an e-neighborhood of its 
mean. 


(Chebyshev’s inequality.) Let W be any random variable with mean ts and variance 
o?. For any ¢ >0, 


52 
PW pl <e)= 1-3 
g 
or, equivalently, 


2 
P(WW—pl>e)< 
E 


Proof In the continuous case, 


vara = [ (y — w)? fry) dy 


be Ute ar 
-| = WP frordy+ f = WP frordy+ f (= wy frO)ey 


(ee) --éE ute 


Omitting the nonnegative middle integral gives an inequality: 


Var(Y) > a 


—0oO 


ow fronay+ f (y =u)? fry) dy 


U+e 


> | (y— wf”) dy 
ly-ulze 


> | & fry) dy 
ly-ulze 


=e P(\¥Y — p>) 


Division by ¢? completes the proof. (If the random variable is discrete, replace the 
integrals with summations.) 


Suppose that X,, X2,...,X, is a random sample of size n from a discrete pdf 
px(k; w), where E(X) = w and Var(X) = 0? < oo. Let fin = (4) OX. Is fin a 
i=l 


consistent estimator for 1? 
According to Chebyshev’s inequality, 


Var (fin) 
e2 


P(|fin — pl <@) >1—- 
But Var(ji,) = Var (4 x) = 5 > Var(X;) = (1/n?)-no* =07/n, so 


i=l 


i 
o2 


P(|fin — #| <2) > 1- 
né 


For any ¢, 5, and o”, ann can be found that makes x <6. Therefore, lim P(|fn— 
noo 


LL| <€)=1 (i.e., 4, is consistent for j). 


Questions 
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Comment The fact that the sample mean, (i, is necessarily a consistent estimator 
for the true mean jz, no matter what pdf the data come from, is often referred to as 
the weak law of large numbers. It was first proved by Chebyshev in 1866. a 


Comment We saw in Section 5.6 that one of the theoretical reasons that justifies 
using the method of maximum likelihood to identify good estimators is the fact 
that maximum likelihood estimators are necessarily functions of sufficient statis- 
tics. As an additional rationale for seeking maximum likelihood estimators, it can be 
proved under very general conditions that maximum likelihood estimators are also 
consistent (see 93). 


5.7.1. How large a sample must be taken from a normal 
pdf where £(Y) = 18 in order to guarantee that 4, =Y, = 


1\°Y; has a 90% probability of lying somewhere in the 
i=1 
interval [16, 20]? Assume that o =5.0. 


5.7.4. An estimator 6, is said to be squared-error consis- 
tent for 6 if lim E[(@, — 9)7] =0. 
(a) Show that any squared-error consistent 6, is asymp- 
totically unbiased (see Question 5.4.15). 


5.7.2. Let Y,, Y>, és 


., ¥, be arandom sample of size n from 


(b) Show that any squared-error consistent 6, is consis- 
tent in the sense of Definition 5.7.1. 


a normal pdf having » = 0. Show that S?=+)°Y? isa 5.7.5. Suppose 6, = Ymax is to be used as an estima- 
i=1 


consistent estimator for o? = Var(Y). 
5.7.3. Suppose Y, Yo,.. 
exponential pdf, fy(y; 4) =Aae’, y > 0. 
(a) Show that i, = Y, is not consistent for A. 
(b) Show that dn = )~Y, is not consistent for A. 


tor for the parameter @ in the uniform pdf, fy(y; 6) = 
1/0,0< y <0. Show that 6, is squared-error consistent (see 
Question 5.7.4). 


., Y, is arandom sample from the 


5.7.6. If 2n + 1 random observations are drawn from 
a continuous and symmetric pdf with mean wp and if 
fy (u; “) £0, then the sample median, Y/, ,, is unbiased for 
gw, and Var(Y/,,) = 1/(8[ fy (wu; )’n) [see (54)]. Show that 


ft, = Y,,, 1s consistent for py. 


5.8 Bayesian Estimation 


Bayesian analysis is a set of statistical techniques based on inverse probabilities cal- 
culated from Bayes’ Theorem (recall Section 2.4). In particular, Bayesian statistics 
provide formal methods for incorporating prior knowledge into the estimation of 
unknown parameters. 

An interesting example of a Bayesian solution to an unusual estimation problem 
occurred some years ago in the search for a missing nuclear submarine. In the spring 
of 1968, the USS Scorpion was on maneuvers with the Sixth Fleet in Mediterranean 
waters. In May, she was ordered to proceed to her homeport of Norfolk, Virginia. 
The last message from the Scorpion was received on May 21, and indicated her 
position to be about fifty miles south of the Azores, a group of islands eight hundred 
miles off the coast of Portugal. Navy officials decided that the sub had sunk some- 
where along the eastern coast of the United States. A massive search was mounted, 
but to no avail, and the Scorpion’s fate remained a mystery. 

Enter John Craven, a Navy expert in deep-water exploration, who believed the 
Scorpion had not been found because it had never reached the eastern seaboard and 
was still somewhere near the Azores. In setting up a search strategy, Craven divided 


334 Chapter 5 Estimation 


Example 
5.8.1 


the area near the Azores into a grid of n squares, and solicited the advice of a group 
of veteran submarine commanders on the chances of the Scorpion having been lost 
in each of those regions. Combining their opinions resulted in a set of probabilities, 
P(A)), P(A2),..., P(A), that the sub had sunk in areas 1,2, ... , n, respectively. 

Now, suppose P(A;) was the largest of the P(A;)’s. Then area k would be the 
first region searched. Let B; be the event that the Scorpion would be found if it had 
sunk in area k and area k was searched. Assume that the sub was not found. From 
Theorem 2.4.2, 


P(BE| Ay) P(Ag) 
(By | Ax) P(Ax) + P(Be AG) P(AG) 


P(A BO =s 


becomes an updated P(A,)—call it P*(A,). The remaining P(A;)’s, i #k, can then 
be normalized to form the revised probabilities P*(A;),i Ak, where }* P*(A;) =1. 


i=l 
If P*(A;) was the largest of the P*(A;)’s, then area j would be searched next. If the 
sub was not found there, a third set of probabilities, P**(A1), P**(A2),..., P**(An), 
would be calculated in the same fashion, and the search would continue. 

In October of 1968, the USS Scorpion was, indeed, found near the Azores; 
all ninety-nine men aboard had perished. Why it sunk has never been disclosed. 
One theory has suggested that one of its torpedoes accidentally exploded; Cold 
War conspiracy advocates think it may have been sunk while spying on a group 
of Soviet subs. What is known is that the strategy of using Bayes’ Theorem to 
update the location probabilities of where the Scorpion might have sunk proved 
to be successful. 


Prior Distributions and Posterior Distributions 


Conceptually, a major difference between Bayesian analysis and non-Bayesian 
analysis is the assumptions associated with unknown parameters. In a non-Bayesian 
analysis (which would include all the statistical methodology in this book except 
the present section), unknown parameters are viewed as constants; in a Bayesian 
analysis, parameters are treated as random variables, meaning they have a pdf. 

At the outset in a Bayesian analysis, the pdf assigned to the parameter may 
be based on little or no information and is referred to as the prior distribution. 
As soon as some data are collected, it becomes possible—via Bayes’ Theorem— 
to revise and refine the pdf ascribed to the parameter. Any such updated pdf is 
referred to as a posterior distribution. In the search for the USS Scorpion, the 
unknown parameters were the probabilities of finding the sub in each of the grid 
areas surrounding the Azores. The prior distribution on those parameters were 
the probabilities P(A,), P(A2),..., P(A,). Each time an area was searched and 
the sub not found, a posterior distribution was calculated—the first was the set 
of probabilities P*(A1), P*(A2),..., P*(An); the second was the set of probabilities 
P**(A,), P**(A2),..., P**(A,); and so on. 


Suppose a retailer is interested in modeling the number of calls arriving at a phone 
bank in a five-minute interval. Section 4.2 established that the Poisson distribution 
would be the pdf to choose. But what value should be assigned to the Poisson’s 
parameter, A? 

If the rate of calls was constant over a twenty-four-hour period, an estimate i, 
for A could be calculated by dividing the total number of calls received during a full 
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day by 288, the latter being the number of five-minute intervals in a twenty-four- 
hour period. If the random variable X, then, denotes the number of calls received 
during a random five-minute interval, the estimated probability that X =k would be 


px(k) =e =, k=0, 1,2... 

In reality, though, the incoming call rate is not likely to remain constant over an 
entire twenty-four-hour period. Suppose, in fact, that an examination of telephone 
logs for the past several months suggests that A equals 10 about three-quarters of 
the time, and it equals 8 about one-quarter of the time. Described in Bayesian 
terminology, the rate parameter is a random variable A, and the (discrete) prior 


distribution for A is defined by two probabilities: 
Pa(8) = P(A =8) =0.25 


and 


px(10) = P(A= 10) =0.75 

Now, suppose certain facets of the retailer’s operation have recently changed 
(different products to sell, different amounts of advertising, etc.). Those changes 
may very well affect the distribution associated with the call rate. Updating the prior 
distribution for A requires (a) some data and (b) an application of Bayes’ Theorem. 
Being both frugal and statistically challenged, the retailer decides to construct a 
posterior distribution for A on the basis of a single observation. To that end, a five- 
minute interval is preselected at random and the corresponding value for X is found 
to be 7. How should p, (8) and pa (10) be revised? 

Using Bayes’ Theorem, 

P(X =7|A=10)P(A=10) 

P(X =7| A=8)P(A=8)+ P(X =7| A=10)P(A =10) 
7 e 1010" (9.75) 
~ (e881) (0.25) + e- 101 0.75) 
_ (0.090) (0.75) 
~~ (0.140) (0.25) + (0.090)(0.75) 


P(A=10|X=7)= 


= 0.659 


which implies that 

P(A=8|X =7)=1-0.659 =0.341 
Notice that the foe distribution for A has changed in a way that makes sense 
intuitively. Initially, P(A = 8) was 0.25. Since the data point, x =7, is more consistent 
with A = 8 than with A = 10, the posterior pdf has increased the probability that 
A =8 (from 0.25 to 0.341) and decreased the probability that A = 10 (from 0.75 to 
0.659). = 


Definition 5.8.1. Let W be a statistic dependent on a parameter 0. Call its pdf 
fw(w |@). Assume that 6 is the value of a random variable ©, whose prior dis- 
tribution is denoted pe(@), if © is discrete, and f@(@), if © is continuous. The 
posterior distribution of ©, given that W = w, is the quotient 


Pw(wl9) fo) ‘ a 
Fe. pw (wi6) fo (6) a0 if W is discrete 


= wane if W is continuous 


8o0|W=w)= 


[Note: If © is discrete, call its pdf p (@) and replace the integrations with 
summations. ] 
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Comment Definition 5.8.1 can be used to construct a posterior distribution even 
if no information is available on which to base a prior distribution. In such cases, 
the uniform pdf is substituted for either pe(@) or fe(6) and referred to as a 
noninformative prior. 


Max, a video game pirate (and Bayesian), is trying to decide how many illegal copies 
of Zombie Beach Party to have on hand for the upcoming holiday season. To get a 
rough idea of what the demand might be, he talks with n potential customers and 
finds that X =k would buy a copy for a present (or for themselves). The obvious 
choice for a probability model for X, of course, would be the binomial pdf. Given n 
potential customers, the probability that k would actually buy one of Max’s illegal 
copies is the familiar 


px(k|0)= () ok(1—6)"-*, k=0,1,...,n 
where the maximum likelihood estimate for @ is given by 6, = x 

It may very well be the case, though, that Max has some additional insight about 
the value of @ on the basis of similar video games that he illegally marketed in 
previous years. Suppose he suspects, for example, that the percentage of potential 
customers who will buy Zombie Beach Party is likely to be between 3% and 4% and 
probably will not exceed 7%. A reasonable prior distribution for ©, then, would be 
a pdf mostly concentrated over the interval 0 to 0.07 with a mean or median in the 
0.035 range. 

One such probability model whose shape would comply with the restraints that 
Max is imposing is the beta pdf. Written with © as the random variable, the (two- 
parameter) beta pdf is given by 

l(r+s) 


— r-l1 s-1 
fo = Fore)” (=a; 02021 


The beta distribution with r = 2 and s = 4 is pictured in Figure 5.8.1. By choosing 
different values for r and s, fo(@) can be skewed more sharply to the right or 
to the left, and the bulk of the distribution can be concentrated close to zero or 
close to one. The question is, if an appropriate beta pdf is used as a prior dis- 
tribution for ©, and if a random sample of k potential customers (out of n) said 
they would buy the video game, what would be a reasonable posterior distribution 
for 0? 


2.4 
1.6 
al 
3 pee fo() 
oO 
la) 
8 
6 
0 2 A 6 8 1.0 


Figure 5.8.1 


Example 
5.8.3 
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From Definition 5.8.1 for the case where W (= X) is discrete and © is continuous, 
px(k |) fo() 


9(9|X=k)= 
8o0(6 | ) J, Pxtk | 0) fo(@) dO 


Substituting into the numerator gives 


r 
px(k|0) fo(0)=({ ) ay ea ay! 
— () rr +s) ger a _ oe 
k/ V(r)V(s) 


sO 


n T(r+s) gk+r—1 n—k+s—1 
9) Finney? (=e) 


Ta (1) Bago ayo 


go(@|X=k)= 


(") T(r+s) 
k/ T(r)T(s) 7 gaa | = 0 mia! 


; F (i) reRsORr NG — Oy db 


Notice that if the parameters r and s in the beta pdf were relabeled k +r and 
n—k-+s, respectively, the equation for f@(0) would be 
_ Tn +r y) k+r-1 n—k+s—1 
fo@)= FEtTG=Eia Cee 
But those same exponents for 6 and (1 — 6) appear outside the brackets in the 
expression for g@(6 | X =k). Since there can be only one f@(@) whose variable factors 
are 6*+"—1!(] — 9)"-*+5-1_ it follows that go(6 | X =k) is a beta pdf with parameters 
k+randn—k+s. 

The final step in the construction of a posterior distribution for © is to choose 
values for r and s that would produce a (prior) beta distribution having the config- 
uration described on p. 336—that is, with a mean or median at 0.035 and the bulk 
of the distribution between 0 and 0.07. It can be shown [see (92)] that the expected 
value of a beta pdf is r/(r +s). Setting 0.035, then, equal to that quotient implies 
that 


§ = 28r 


By trial and error with a calculator that can integrate a beta pdf, the values r = 4 and 
s = 28(4) = 102 are found to yield an fo(@) having almost all of its area to the left 
of 0.07. Substituting those values for r and s into ge@(0 | X =k) gives the completed 
posterior distribution: 
I'(n + 106) pay ioe 
50 |X=k)= ap ee 
g0(@ | )= Teg Ora —k +102) Gea?) 
_ (n+ 105)! 
~ (k+3)'\(n—-k+ 10D)! 


6k+3(1 gyre iot 


Certain prior distributions “fit” especially well with certain parameters in the sense 
that the resulting posterior distributions are easy to work with. Example 5.8.2 was 
a case in point—assigning a beta prior distribution to the unknown parameter in a 
binomial pdf led to a beta posterior distribution. A similar relationship holds if a 
gamma pdf is used as the prior distribution for the parameter in a Poisson model. 
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Suppose X), X2,..., X, denotes a random sample from the Poisson pdf, px (k | 


6) = e °6* /k!,k=0,1,.... Let W= >- X;. By Example 3.12.10, W has a Poisson 

i=l 

distribution with parameter n6 —that is, pw(w | 0) =e-"9 (n0)" /w!, w =0, 1, 2,.... 
Let the gamma pdf, 


fo) = re ees 0<0<co 
S 


be the prior distribution assigned to ©. Then 


Pw(w|@) fo(@) 
fy Pw(w|0) fo(0) de 


ge(0|W=w)= 


where 


pw(w| 8) fo(8) = en OO” HE gst .-no 


w! TCs) 
= nv we guts—le-(rnye 
w! T(s) 


Now, using the same argument that simplified the calculation of the posterior 
distribution in Example 5.8.2, we can write 


nee 
= = w! T(s) w+s—1—(uw+n)e 
9(6|W=w)= 6 e 
ae J, pw(w |) fo(0) do 


But the only pdf having the factors 9”*+*~'e~“*™® is the gamma distribution with 
parameters w+ s and w +n. It follows, then, that 


w+s 
rfc) (6 | W= w) a (a+ gwts—1 o— etn) 
TWw+s) 


Case Study 5.8.1 


Predicting the annual number of hurricanes that will hit the U.S. mainland is a 
problem receiving a great deal of public attention, given the disastrous sum- 
mer of 2004, when four major hurricanes struck Florida causing billions of 
dollars of damage and several mass evacuations. For all the reasons discussed 
in Section 4.2, the obvious pdf for modeling the number of hurricanes reach- 
ing the mainland is the Poisson, where the unknown parameter 6 would be the 
expected number in a given year. 

Table 5.8.1 shows the numbers of hurricanes that actually did come ashore 
for three fifty-year periods. Use that information to construct a posterior 
distribution for 6. Assume that the prior distribution is a gamma pdf. 


(Continued on next page) 
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Table 5.8.1 

‘Years Number of Hurricanes 
1851-1900 88 
1901-1950 92 
1951-2000 72 


Not surprisingly, meteorologists consider the data from the earliest period, 
1851 to 1900 to be the least reliable. Those eighty-eight hurricanes, then, will be 
used to formulate the prior distribution. Let 


fo(6) = ‘oo 0<0 <0 


Recall from Theorem 4.6.3 that for a gamma pdf, E(©) = s/w. For the years 
from 1851 to 1900, though, the sample average number of hurricanes per year 
was s. Setting the latter equal to E(©) allows s = 88 and uw = 50 to be assigned 
to the gamma’s parameters. That is, we can take the prior distribution to be 


5088 
(9) = g88-1 —500 
fo(@) (88) e 
Also, the posterior distribution given at the end of Example 5.8.3 becomes 
Ceo ae Qu+87 6 (504n)6 
I'(w + 88) 


The data, then, to incorporate into the posterior distribution would be the 
fact that w = 92 + 72 = 164 hurricanes occurred over the most recent n = 100 
years included in the database. Therefore, 


go(0|W=w)= led aaa ai loo 9164487 ,— O0+100)8 _ (150)"™* 251 51509 
(164+ 88) 1 (252) 


go(0|W=w)= 


Example In the examples seen thus far, the joint pdf gw.o(w, 0) = pw(w | @) fo(@) of a statistic 
5.8.4 W and a parameter © [with a prior distribution f,(@)] was the starting point in find- 
ing the posterior distribution of ©. For some applications, though, the objective is 
not to derive g(0 | W =w), but, rather, to find the marginal pdf of W. 
For instance, suppose a sample of size n = | is drawn from a Poisson pdf, pw (w | 
6) =e *6”/w!, w=0, 1,..., where the prior distribution is the gamma pdf, fy (0) = 
ote. According to Example 5.8.3, 


1 S 
gwo(w, 0) = pw(w |) fo(0) = —+—9rtsle- H+ 9 
w! T(s) 


What is the corresponding marginal pdf of W —that is, pw(w)? 
Recall Theorem 3.7.2. Integrating the joint pdf of W and © over 6 gives 


Pw(w)= i: gw.e(w, 0) dé 
0 


-[- J guts-1g-urno gg 
0 w!T(s) 
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— i we [Fortean 
w! T'(s) Jo 


4 Ww (wes) 
~ wl P(s) (at 1)ets 


_Tw+s) LL )( 1 y 
~ wiT(s) (5 utl 


But et) — ‘Gea! Finally, let p = w/(u +1), so 1 — p=1/(u+ I), and the 


wiT(s) 
marginal pdf reduces to a negative binomial distribution with parameters s and p: 


w+s-1l 


pw(w)= ( ) pra" 


(see Question 4.5.6). = 


Case Study 5.8.2 


Psychologists use a special coordination test to study a person’s likelihood of 
making manual errors. For any given person, the number of such errors made 
on the test is known to follow a Poisson distribution with some particular value 
for the rate parameter, 0. But as we all know (from watching the clumsy peo- 
ple around us who spill things and get in our way), @ varies considerably from 
person to person. Suppose, in fact, that variability in © can be described by a 
gamma pdf. If so, the marginal pdf of the number of errors made by a individual 
should have a negative binomial distribution (according to Example 5.8.4). 
Columns 1 and 2 of Table 5.8.2 show the number of errors made on the coor- 
dination test by a sample of 504 subjects—82 made zero errors, 57 made one 
error, and so on. To know whether those responses can be adequately modeled 


Table 5.8.2 
Number of Observed Negative Binomial Predicted 
Errors, w Frequency Frequency 
0 82 79.2 
1 57 57.1 
2 46 46.3 
3 39 38.9 
4 33 33.3 
5 28 28.8 
6 25 25.1 
i) 22 22.0 
8 19 19.3 
9 17 17.0 
10 15 15.0 
11 13 13.3 
12 12 11.8 
13 10 10.4 


(Continued on next page) 
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Table 5.8.2 (continued) 
Number of Observed Negative Binomial Predicted 
Errors, w Frequency Frequency 
14 9 9.3 
15 8 8.3 
16 7 7.3 
17 6 6.5, 
18 6 5.8 
19 5 5.2 
20 5 4.6 
21 4 4.1 
22 4 3:7 
23 3 33 
24 3 2.9 
25 3 2.6 
26 2 2.4 
27 2 2.1 
28 2 1.9 
29 2 1.7 
30 2 1.5 
>31 13 13.1 
Total 504 504.0 


by a negative binomial distribution requires that the parameters p and s be 


estimated. To that end, it should be noted that the maximum likelihood estimate 
n 


for p in a negative binomial is ns/>° w;. Expected frequencies, then, can be 


i=l 
calculated by choosing a value for s and solving for p. By trial and error, the 
entries shown in Column 3 were based on a negative binomial pdf for which 
s =0.8 and p = (504)(0.8)/3821 = 0.106. Clearly, the model fits exceptionally 
well, which supports the analysis carried out in Example 5.8.4. 


Bayesian Estimation 


Fundamental to the philosophy of Bayesian analysis is the notion that all relevant 
information about an unknown parameter, 0, is encoded in the parameter’s pos- 
terior distribution, g@(0 | W = w). Given that premise, an obvious question arises: 
How can ge (9 | W =w) be used to calculate an appropriate point estimator, 6? One 
approach, similar to using the likelihood function to find a maximum likelihood esti- 
mator, is to differentiate the posterior distribution, in which case the value for which 
dge(0|W =w)/dd =0—that is, the mode—becomes 6. 

For theoretical reasons, though, a method much preferred by Bayesians is to use 
some key ideas from decision theory as a framework for identifying a reasonable 0. 
In particular, Bayesian estimates are chosen to minimize the risk associated with 0, 
where the risk is the expected value of the /oss incurred by the error in the estimate. 
Presumably, as 6 — 6 gets further away from 0—that is, as the estimation error gets 
larger—the loss associated with 6 will increase. 


Definition 5.8.2. Let 6 be an estimator for 6 based on a statistic W. The loss 
function associated with 6 is denoted L(6, 6), where L(6, 0) >0 and L(6,6)=0. 
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Example 
5.8.5 


Theorem 
5.8.1 


It is typically the case that quantifying in any precise way the consequences, eco- 
nomic or otherwise, of 6 not being equal to @ is all but impossible. The “generic” 
loss functions defined in those situations are chosen primarily for their mathemati- 
cal convenience. Two of the most frequently used are L@, é)= |6 — 6| and L(@6, é)= 
(6 — 6). Sometimes, though, the context in which a parameter is being estimated 
does allow for a loss function to be defined in a very specific and relevant way. 

Consider the inventory dilemma faced by Max, the Bayesian video game pirate 
whose illegal activities were described in Example 5.8.2. The unknown parameter 
in question was @, the proportion of his n potential customers who would purchase 
a copy of Zombie Beach Party. Suppose that Max decides —for whatever reasons — 
to estimate 6 with 6. As a consequence, it would follow that he should have n 6 
copies of the video game available. That said, what would be the corresponding loss 
funciton? 

Here, the implications of 6 not being equal to @ are readily quantifiable. If 6 <6, 
then n(@ — 6) sales will be lost (at a cost of, say, $c per video). On the other hand, 
if 6 > 0, there will be n(O — 6) unsold videos, each of which will incur a storage cost 
of, say, $d per unit. The loss function that applies to Max’s situation, then, is clearly 
defined: 


$cn(o@—6) if6<@ 


i Pee —6) if6>@ 


Definition 5.8.3. Let L(6, 0) be the loss function associated with an estimate of 
the parameter 0. Let gg(0 | W = w) be the posterior distribution of the random 
variable ©. Then the risk associated with 6 is the expected value of the loss 
function with respect to the posterior distribution of 6. 


f, L@,0)g0(@|W=w)d0 if © is continuous 


risk = > LG6,A)g00|W=w) if @is discrete 
all 6 


Using the Risk Function to Find 6 


Given that the risk function represents the expected loss associated with the esti- 
mator 6, it makes sense to look for the 6 that minimizes the risk. Any 6 that achieves 
that objective is said to be a Bayes estimate. In general, finding the Bayes estimate 
requires solving the equation d(risk)/d@ = 0. For two of the most frequently used 
loss functions, L(6, 0) = | — 6| and L(@6, 0) =(6 —6), though, there is a much easier 
way to calculate é. 


Let go (8 | W =w) be the posterior distribution for the unknown parameter 0. 


a. If the loss function associated with 6 is L(6,0) =|6 —6|, then the Bayes estimate 
for 0 is the median of go(@ |W =w). 

b. If the loss function associated with 6 is L( 6,0)=(6-6 )°, then the Bayes estimate 
for 0 is the mean of ga(0|W =w). 
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Proof 


a. The proof follows from a general result for the expected value of a random 
variable. The fact that the pdf in the expectation here is a posterior distribution 
is irrelevant. The derivation will be given for a continuous random variable 
(having a finite expected value); the proof for the discrete case is similar. 

Let fw(w) be the pdf for the random variable W, where the median of W is 
m. Then 


(oe) 


E(w m= f law — mal for (w) dw 


—0o 


=f marae + (w—m) fy (w) dw 


=n fo fv(wydw— fo wfw(w)dw 


+f wfw)dw—m | tw(w)dw 


The first and last integrals are equal by definition of the median so, 


m 


E(w —m)=— [ ufww)dw+ f whw(w)dw 


Now, suppose m > 0 (the proof for negative m is similar). Splitting the first 
integral into two parts gives 


0 


ufww)dw— f ufww)dw+ f whw(w) dw 
lo) 0 m 


E(w —m)=— [ 


Notice that the middle integral is positive, so changing its negative sign to a plus 
implies that 


0 


E(W—mp<— [ ufuw)dw+ f ufw(w)dw+ f wfw(w) dw 
oe) 0 m 


0 ioe) 
<| —ufvw)dw+ f whw(w)dw 
= 0 


(oe) 


Therefore, 
E(\W—m|) < E(\W)) (5.8.1) 
Finally, suppose b is any constant. Then 


1 
eo We ee) 


showing that m — b is the median of the random variable W — b. Applying 
Equation 5.8.1 to the variable W — b, we can write 


E(|W — m|) = E[|(W — b) — (m— b)|] < E(\W — B)) 


which implies that the median of go(6 | W = w) is the Bayes estimate for 9 when 
L(6,0)=|6 —9|. 
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b. Let W be any random variable whose mean is yx and whose variance is finite, 
and let b be any constant. Then 


E[(W — b)*]= EW) + (u—b)/P 


implying that E[(W — 


= E[(W — p)*]+2(u—b)E(W — pw) +(u—by 
= Var(W) +0+ (uw —b)* 


b)/° is minimized when b = uw. It follows that the Bayes 


estimate for 6, given a quadratic loss function, is the mean of the posterior 


distribution. 


Example 
5.8.6 


Pw(wl|0) =e" (nd) /w!, 


Recall Example 5.8.3, where the parameter in a Poisson distribution was assumed to 


n 
have a gamma prior distribution. For a random sample of size n, where W = )- Xj, 


i=1 


w=0,1,2,... 


fo(0) = +6" 6" 


T'(s) 


which resulted in the posterior distribution being a gamma pdf with parameters 


w+s and w+n. 


Suppose the loss function associated with 6 is quadratic, L(@, 0) = (6 — 6)”. By 
part (b) of Theorem 5.8.1, the Bayes estimate for 6 is the mean of the posterior 
distribution. From Theorem 4.6.3, though, the mean of go(9 | W =w) is (w+s)/ 


(u+n). 
Notice that 


wts on eS 4 lL s ) 

wtn ptn n wtn\p 
which shows that the Bayes estimate is a weighted average of “, the maximum like- 
lihood estimate for 6 and +, the mean of the prior distribution. Moreover, as n gets 


large, the Bayes estimate converges to the maximum likelihood estimate. 


Questions 


5.8.1. Suppose that X is a geometric random variable, 
where px (k|6) = (1 — 0)*"'6, k= 1, 2, ... . Assume that the 
prior distribution for 6 is the beta pdf with parameters r 
and s. Find the posterior distribution for 0. 


5.8.2. Find the squared-error loss [L(@, 4) = (6 — 6)7] 
Bayes estimate for 6 in Example 5.8.2 and express it as 
a weighted average of the maximum likelihood estimate 
for 6 and the mean of the prior pdf. 


5.8.3. Suppose the binomial pdf described in Example 
5.8.2 refers to the number of votes a candidate might 
receive in a poll conducted before the general election. 
Moreover, suppose a beta prior distribution has been 
assigned to 6, and every indicator suggests the election 
will be close. The pollster, then, has good reason for con- 
centrating the bulk of the prior distribution around the 


value 6 = +. Setting the two beta parameters r and s both 
equal to 135 will accomplish that objective (in the event 
r =s = 135, the probability of 6 being between 0.45 and 
0.55 is approximately 0.90). 


(a) Find the corresponding posterior distribution. 

(b) Find the squared-error loss Bayes estimate for 6 
and express it as a weighted average of the maxi- 
mum likelihood estimate for 6 and the mean of the 
prior pdf. 


5.8.4. What is the squared-error loss Bayes estimate for 
the parameter @ in a binomial pdf, where 6 has a uniform 
distribution —that is, a noninformative prior? (Recall that 
a uniform prior is a beta pdf for which r = s = 1.) 


5.8.5. In Questions 5.8.2-5.8.4, is the Bayes estimate 
unbiased? Is it asymptotically unbiased? 
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5.8.6. Suppose that Y is a gamma random variable with 
parameters r and @ and the prior is also gamma with 
parameters s and jz. Show that the posterior pdf is gamma 
with parameters r+s and y+ p. 


5.8.7. Let Y,,¥2,...,Y, be a random sample from a 
gamma pdf with parameters r and 9, where the prior dis- 
tribution assigned to @ is the gamma pdf with parameters 
sand uw. Let W=Y,+Y.+---+ Y,. Find the posterior pdf 
for 6. 


5.8.8. Find the squared-error loss Bayes estimate for @ in 
Question 5.8.7. 


5.8.9. Consider, again, the scenario described in Exam- 
ple 5.8.2—a binomial random variable X has parameters n 
and 9, where the latter has a beta prior with integer param- 
eters r and s. Integrate the joint pdf px(k|0) fo(0) with 
respect to 6 to show that the marginal pdf of X is given by 


(es) (Cre ) 
k n—k 


aaa | , k=0,1,...,n 


Px(k) = 


5.9 Taking a Second Look at Statistics (Beyond 
Classical Estimation) 


The theory of estimation presented in this chapter can properly be called classical. 
It is a legacy of the late nineteenth and early twentieth centuries, culminating in the 
work of R.A. Fisher, especially his foundational paper published in 1922 (47). 

This chapter covers the historical, yet still vibrant, theory and technique of esti- 
mation. This material is the basis for many of the modern advances in statistics. 
And, these approaches still provide useful methods for estimating parameters and 


building models. 


But statistics, like every other branch of knowledge, progresses. As is the case 
for most sciences, the computer has dramatically changed the landscape. Classi- 
cal problems—such as finding maximum likelihood estimators—that were difficult, 
if not impossible, to solve in Fisher’s day can now be attacked through computer 


approximations. 


However, modern computers not only give new methods for old problems, 
but they also provide new avenues of approach. One such set of new methods 
goes under the general name of resampling. One part of resampling is known as 
bootstrapping. This technique is useful when classical inference is impossible. 

A general explication of bootstrapping is not possible in this section, but an 
example of its application to estimating the standard error should provide a sense of 


the idea. 


The standard error of an estimator 6 is just its standard deviation; that is, 


/ Var(6). The standard error, or an approximation of it, is an essential part of the 


construction of confidence intervals. For the normal case, Y is the basis of the con- 
fidence interval, and its standard error is o/./n. If X is a binomial random variable 


with parameters n and p, then the standard error ,/ pC?) is readily approximated 


by V/ Hosa} where k is the observed number of successes. 

In general, though, estimating the standard error may not be so straightforward. 
As a case in point, consider the gamma pdf with r = 2 and unknown parameter 
0, fyO; 9) = a ye~»/®, Recall from Example 5.2.2 that the maximum likelihood 


estimator for 6 is $Y. Then its variance is 


Decsse * a aa 
Var(;¥) = q vary) = 


and the standard error is the square root of the variance, or 


1Var(Y) 1 e? 
ar( Ms 1 = 
4 on 4n 2n 


pO 
JV2n* 
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To understand the technique of the bootstrapping estimate of the standard error 
in this case, let us consider a numerical example, given in a series of steps. 


Step 1. Bootstrapping begins with a random sample from the pdf of interest. If we 
let n = 15, Table 5.9.1 is the given sample from fy (y; 6) = aye! 


Table 5.9.1 


30.987 9.949 26.720 9.651 29.137 
47.653 33.250 4.933 17.923 2.400 


7.580 9.941 16.624 28.514 10.693 


Step 2. The sum of the entries in the table is 285.955, so the maximum likelihood 
estimate of the parameter 6 is 


1 1 1 
0. = emer ae (285.955) = 9.5318 
Step 3. Then using the estimate 6, = 9.5318 for 6, two hundred random samples 
from the pdf fy(y; 9.5318) = gaypzve?/?** are generated. How this is done 
using Minitab will be discussed in Appendix 5.A.1. 
It suffices here to note that samples appear as an array of numbers with 
fifteen columns and two hundred rows. Each row represents a random sample 
of size 15 from the indicated gamma pdf. 


Table 5.9.2 


19.445 10.867 6.183 3.517 20.388 51.501 14.735 52.809 11.244 59.533 15.135 15.579 14.354 22.670 2 
11.808 4.380 12.44 9.208 9.222 2.674 63.703 36.037 46.190 22.793 23.329 40.706 23.872 40.909 4 


(Additional 197 rows) 


7.536 4.693 7.452 22.606 11.512 2.136 2.718 25.778 16.023 27.405 18.801 65.723 0.853 7.536 4 


Step 4. Use each row of Table 5.9.2 to obtain 6, = sy, the estimate of the unknown 
parameter 0. For the first row in Table 5.9.2, we obtain 6, = 11.2873, and for the 
second, 0, = 11.6986. 

Step 5. From Step 4, two hundred estimates of 6 result. Calculate the sample stan- 
dard deviation of these two hundred numbers, which gives the value 1.83491. 
This is the bootstrap estimate of the standard error. 

The value of 6 that generated the original sample in Table 5.9.1 was 10. Thus, 
the actual standard error is 


0 10 


Van J2-15 


The bootstrap estimate of 1.83491 is quite close to the actual value. 


= 1.82574 


Appendix 5.A.1 Minitab Applications 


Because of their ability to generate random observations from many of the standard 
probability distributions, computers can be very effective in illustrating estimation 
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properties and procedures. We have also seen in Sections 4.7 and 5.9 that computers 
are essential tools for new estimation techniques. 

The meaning of confidence intervals can also be nicely demonstrated using 
Minitab’s RANDOM command. Deriving formulas for confidence intervals is 
straightforward, but calling attention to their variability from sample to sample is 
best accomplished using a Monte Carlo analysis. Example 5.3.1 is a case in point. 
The fifty simulated 95% confidence intervals displayed in Table 5.3.1 reinforce the 
interpretation that should be accorded to any particular evaluation of the formula 


(7- 1.9698, 5+ 1.9628). 

The distributions of estimators—and some of their important properties—can 
also be easily examined using the computer. Recall the serial number analysis 
described in Case Study 1.2.2. If the production numbers to be estimated are large, 
then the assumption that the captured serial numbers represent a random sample 
from a discrete uniform pdf can reasonably be replaced by the assumption that the 
captured serial numbers represent a random sample from the (easier-to-work-with) 
continuous uniform pdf, defined over the interval [0, 6]. Two unbiased estimators for 
é, then, would be 


6, =(2/n) 
i=1 


and 
62 =[(n + 1)/2]¥ max 
Question 5.4.18 gave a special case of the more general result that 
Var (6,) = 07 /[n(n + 2)] < Var(6,) =67/3n 


But suppose the complexity of two unbiased estimators precluded the calculation of 
their variances. How would we decide which to use? Probably the simplest solution 
would be to simulate each one’s distribution and compare their sample standard 
deviations. 

Figures 5.A.1.1 and 5.A.1.2 illustrate that technique on the two estimators 


6: =(2/n) SOY; and 6 =[(n+1)/n]¥max 


i=1 


for the uniform parameter 6. Suppose that n = 5 serial numbers have been “cap- 
tured” and the true value for 6 is 3400. Figure 5.A.1.1 shows the Minitab syntax 
for generating two hundred samples of size 5 from fy(y; 0) = 1/3400, 0O< y < 
3400, and calculating 6,. The DESCRIBE command shows that the average of the 
6.’s is 3383.8 and the sample standard deviation of the two hundred estimates is 
913.2. 

In contrast, Figure 5.A.1.2 details a similar simulation (two hundred sam- 
ples, each of size 5) for the estimator 6). The accompanying DESCRIBE out- 
put lends support to the claim that 6 is the better estimator—it shows the 
average 0, to be closer to the true value of 3400 than the average 6, cal- 
culated from 6, (3398.4 versus 3383.8) and its sample standard deviation is 
smaller than the sample standard deviation of the 6s from 6, (563.9 versus 
913.2). 
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Figure 5.A.1.1 MTB > random 200 cl-c5; 
SUBC > uniform 0 3400. 
MTB > rmean cl-c5 c6 
MTB > let c7 = 2 “c6 
MTB_ > histogram c7; 
SUBC > start 2800; 
SUBC > increment 200. 
Histogram of C7 N = 200 
48 Obs. below the first class 
Midpoint Count 


2800 12, 8 2 2 ae ag 2g a ig ag a a a 
3000 12 Res sklekfesestesok 

3400 13 sea fe fe ese ak ak ak a ae eo 

3600 22 2 ea 2A a 2 fe 28 2 ae afc ae oe 2 2K oe a aca ak ae ok 
3800 17 

4000 L1e enc 

4200 [ek tcotetetat ae oeeeoetet 

4400 8 ie sk ake ok ok ok kok 

4600 10 2 2 2K 3 2 2k a ak 2 

4800 3 oki 

5000 6 

5200 3 okie 

5400 2  **® 


MTB > describe c7 
N MEAN MEDIAN TRMEAN- STDEV SEMEAN 
C7 200 3383.8 3418.3 3388.6 913.2 64.6 
MIN MAX Ql Q3 
C7 997.0 5462.9 2718.0 4002.1 


Figure 5.A.1.2 MTB > random 200 cl-c5; 
SUBC > uniform 0 3400. 
MTB > rmaximum cl-c5 c6 
MTB_ > let c7 = (6/5)*c6 
MTB_ > histogram c7; 
SUBC > start 2800; 
SUBC > increment 200. 
Histogram of C7 N = 200 
32 Obs. below the first class 


Midpoint Count 


3000 10 * id 

3200 17 OR I CK 

3400 22. tk os 2 ok 2s ok ok 2 ok ok ok ok ok ok ok ok 

3800 37 k 2 oie os of 2s 2s ok 2s 2 2k 2s ok ok 2h ok ok ok ok ok 2 2 ok ok ok ok ok ok ok ok ok 

4000 38 2g 246 2H 2k 2k 2 2k 2k 2 fe 2h ie 2k 2k ik fe fe 2k fe 2k 2 ofc 2k 2 ofc 2k 2 oie 2 ik oie 2 2 oie 2k ak oo 
MTB > describe c7 

N MEAN MEDIAN TRMEAN- STDEV SEMEAN 
C7 200 3398.4 3604.6 3437.1 563.9 39.9 
MIN MAX Ql Q3 


C7 1513.9 4077.4 3093.2 3847.9 
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The sample necessary for the bootstrapping example in Section 5.9 was gener- 
ated by a similar set of commands: 


MTB = random 200 cl-c15; 
SUBC > gamma 2 10. 


Given the array in Table 5.9.2, the estimate of the parameter from each row sample 
was obtained by 


MTB = rmean cl-cl15 cl6 
MTB = let cl17=.5*cl6 


Finally, the bootstrap estimate was the standard deviation of the numbers in Column 
17 given by 


MTB > stdev c17 
with the resulting printout 


Standard deviation of C17 = 1.83491 
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Testing Binomial Data—Ho: p = p, 6.6 ‘Taking a Second Look at Statistics (Statistical 
Type I and Type II Errors Significance versus “Practical” Significance) 


As a young man, Laplace went to Paris to seek his fortune as a mathematician, 
disregarding his father’s wishes that he enter the clergy. He soon became a protégé of 
d’Alembert and at the age of twenty-four was elected to the Academy of Sciences. 
Laplace was recognized as one of the leading figures of that group for his work in 
physics, celestial mechanics, and pure mathematics. He also enjoyed some political 
prestige, and his friend, Napoleon Bonaparte, made him Minister of the Interior for a 
brief period. With the restoration of the Bourbon monarchy, Laplace renounced 
Napoleon for Louis XVIII, who later made him a marquis. 

—Pierre-Simon, Marquis de Laplace (1749-1827) 


6.1 Introduction 


Inferences, as we saw in Chapter 5, often reduce to numerical estimates of param- 
eters, in the form of either single points or confidence intervals. But not always. In 
many experimental situations, the conclusion to be drawn is not numerical and is 
more aptly phrased as a choice between two conflicting theories, or hypotheses. A 
court psychiatrist, for example, may be called upon to pronounce an accused mur- 
derer either “sane” or “insane”; the FDA must decide whether a new flu vaccine is 
“effective” or “ineffective”; a geneticist concludes that the inheritance of eye color 
in a certain strain of Drosophila melanogaster either “does” or “does not” follow 
classical Mendelian principles. In this chapter we examine the statistical methodol- 
ogy and the attendant consequences involved in making decisions of this sort. 

The process of dichotomizing the possible conclusions of an experiment and 
then using the theory of probability to choose one option over the other is known 
as hypothesis testing. The two competing propositions are called the null hypothesis 
(written Ho) and the alternative hypothesis (written H,). How we go about choosing 
between Ho and H, is conceptually similar to the way a jury deliberates in a court 
trial. The null hypothesis is analogous to the defendant: Just as the latter is presumed 
innocent until “proven” guilty, so is the null hypothesis “accepted” unless the data 
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argue overwhelmingly to the contrary. Mathematically, choosing between Ho and 
H, is an exercise in applying courtroom protocol to situations where the “evidence” 
consists of measurements made on random variables. 

Chapter 6 focuses on basic principles—in particular, on the probabilistic struc- 
ture that underlies the decision-making process. Most of the important specific 
applications of hypothesis testing will be taken up later, beginning in Chapter 7. 


6.2 The Decision Rule 


Imagine an automobile company looking for additives that might increase gas 
mileage. As a pilot study, they send thirty cars fueled with a new additive on a road 
trip from Boston to Los Angeles. Without the additive, those same cars are known 
to average 25.0 mpg with a standard deviation (c) of 2.4 mpg. 

Suppose it turns out that the thirty cars average y = 26.3 mpg with the additive. 
What should the company conclude? If the additive is effective but the position 
is taken that the increase from 25.0 to 26.3 is due solely to chance, the company 
will mistakenly pass up a potentially lucrative product. On the other hand, if the 
additive is not effective but the firm interprets the mileage increase as “proof” that 
the additive works, time and money will ultimately be wasted developing a product 
that has no intrinsic value. 

In practice, researchers would assess the increase from 25.0 mpg to 26.3 mpg by 
framing the company’s choices in the context of the courtroom analogy mentioned 
in Section 6.1. Here, the null hypothesis, which is typically a statement reflecting the 
status quo, would be the assertion that the additive has no effect; the alternative 
hypothesis would claim that the additive does work. By agreement, we give Hp (like 
the defendant) the benefit of the doubt. If the road trip average, then, is “close” 
to 25.0 in some probabilistic sense still to be determined, we must conclude that 
the new additive has not demonstrated its superiority. The problem is that whether 
26.3 mpg qualifies as being “close” to 25.0 mpg is not immediately obvious. 

At this point, rephrasing the question in random variable terminology will prove 
helpful. Let y, yz, ..., y3o denote the mileages recorded by each of the cars during 
the cross-country test run. We will assume that the y,;’s are normally distributed with 
an unknown mean pj. Furthermore, suppose that prior experience with road tests of 
this type suggests that o will equal 2.4.' That is, 


e Cx) —0O <y<0o 


1 
AO w= VinQA) 


The two competing hypotheses, then, can be expressed as statements about yw. In 
effect, we are festing 


Ao: 4=25.0 (Additive is not effective) 
versus 
Ay: j4>25.0 (Additive is effective) 


Values of the sample mean, y, less than or equal to 25.0 are certainly not grounds 
for rejecting the null hypothesis; averages a bit larger than 25.0 would also lead to 
that conclusion (because of the commitment to give Hp the benefit of the doubt). On 
the other hand, we would probably view a cross-country average of, say, 35.0 mpg as 


1 In practice, the value of o usually needs to be estimated; we will return to that more frequently encountered 
scenario in Chapter 7. 
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Figure 6.2.1 


Figure 6.2.2 


exceptionally strong evidence against the null hypothesis, and our decision would 
be “reject Ho.” In effect, somewhere between 25.0 and 35.0 there is a point—call it 
y*—where for all practical purposes the credibility of Ho ends (see Figure 6.2.1). 


Possible 
sample means 


25.0 y* 
Se 
Values of ¥ not Values of ¥ 
markedly inconsistent that would 
with the H appear to 
assertion that p= 25 refute H) 


Finding an appropriate numerical value for y* is accomplished by combining 
the courtroom analogy with what we know about the probabilistic behavior of Y. 
Suppose, for the sake of argument, we set y* equal to 25.25—that is, we would reject 
Ao if y = 25.25. Is that a good decision rule? No. If 25.25 defined “close,” then Ho 
would be rejected 28% of the time even if Ho were true: 


P(We reject Ho | Ho is true) = P(Y > 25.25 | w = 25.0) 


2.4//30 ~—-2.4//30 
= P(Z>0.57) 
= 0.2843 


é ~95.0. 25.35 =) 
=P > 


(see Figure 6.2.2). Common sense, though, tells us that 28% is an inappropriately 
large probability for making this kind of incorrect inference. No jury, for example, 
would convict a defendant knowing it had a 28% chance of sending an innocent 
person to jail. 


Distribution of Y when Pas a Area = P (Y2 J *| A) is true) 
Ay: w= 25.0 is true Ps by = 0.2843 
N 


23.5 24.0 24.5 25.0 25.5 26.0 26.5 
ye = 25.25 


—Reject H, 


Clearly, we need to make y* larger. Would it be reasonable to set y* equal 
to, say, 26.50? Probably not, because setting y* that large would err in the other 
direction by giving the null hypothesis too much benefit of the doubt. If y* = 26.50, 
the probability of rejecting Hp if Hp were true is only 0.0003: 


P(We reject Ho | Ho is true) = P(Y > 26.50 | w = 25.0) 


x (7 25.0 | 26.50 - =) 
2.4//30—-2.4//30 

= P(Z >3.42) 

= 0.0003 


Figure 6.2.3 
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(see Figure 6.2.3). Requiring that much evidence before rejecting Hp would be anal- 
ogous to a jury not returning a guilty verdict unless the prosecutor could produce a 
roomful of eyewitnesses, an obvious motive, a signed confession, and a dead body 
in the trunk of the defendant’s car! 


1.0 
Distribution of Y when ye 7 
Ay: w= 25.0 is true Pd N 
; Xe 
“ 0.5 \. Area = P(Y2) *! H, is true) 
Y ‘s. = 0.0003 
peas SAL 


23.5 24.0 24.5 25.0 25.5 26.0 26.5 
y* = 26.50 


bg Reject Hy 


If a probability of 0.28 represents too little benefit of the doubt being accorded 
to Hp and 0.0003 represents too much, what value should we choose for P(Y > y* | 
Hp is true)? While there is no way to answer that question definitively or mathemat- 
ically, researchers who use hypothesis testing have come to the consensus that the 
probability of rejecting Ho when Hp is true should be somewhere in the neighbor- 
hood of 0.05. Experience seems to suggest that when a 0.05 probability is used, null 
hypotheses are neither dismissed too capriciously nor embraced too wholeheart- 
edly. (More will be said about this particular probability, and its consequences, in 
Section 6.3.) 


Comment In 1768, British troops were sent to Boston to quell an outbreak of civil 
disturbances. Five citizens were killed in the aftermath, and several soldiers were 
subsequently put on trial for manslaughter. Explaining the guidelines under which a 
verdict was to be reached, the judge told the jury, “If upon the whole, ye are in any 
reasonable doubt of their guilt, ye must then, agreeable to the rule of law, declare 
them innocent” (177). Ever since, the expression “beyond all reasonable doubt” 
has been a frequently used indicator of how much evidence is needed in a jury 
trial to overturn a defendant’s presumption of innocence. For many experimenters, 
choosing y* such that 


P(We reject Ho | Ho is true) = 0.05 


is comparable to a jury convicting a defendant only if the latter’s guilt is established 
“beyond all reasonable doubt.” 


Suppose the 0.05 “criterion” is applied here. Finding the corresponding y* is a 
calculation similar to what was done in Example 4.3.6. Given that 


P(Y > y* | Ap is true) = 0.05 
it follows that 


¥= 25.0. 7° =25.0 ( v 250) 
> =P(Z>—— ) =0.05 
2.4//30 2.4/./30 2.4//30 
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But we know from Appendix A.1 that P(Z > 1.64) = 0.05. Therefore, 


7 = 250 
Se a 6.2.1 
2.4//30 enn 


which implies that y* = 25.718. 

The company’s statistical strategy is now completely determined: They should 
reject the null hypothesis that the additive has no effect if y > 25.718. Since the 
sample mean was 26.3, the appropriate decision is, indeed, to reject Ho. It appears 
that the additive does increase mileage. 


Comment It must be remembered that rejecting Hp does not prove that Hp is false, 
any more than a jury’s decision to convict guarantees that the defendant is guilty. 
The 0.05 decision rule is simply saying that if the true mean (yw) is 25.0, sample 
means (y) as large or larger than 25.718 are expected to occur only 5% of the time. 
Because of that small probability, a reasonable conclusion when y > 25.718 is that 
is not 25.0. 

Table 6.2.1 is a computer simulation of this particular 0.05 decision rule. A total 
of seventy-five random samples, each of size 30, have been drawn from a normal 
distribution having . = 25.0 and o = 2.4. The corresponding y for each sample is 
then compared with y* = 25.718. As the entries in the table indicate, five of the 
samples lead to the erroneous conclusion that Ho: 4 = 25.0 should be rejected. 

Since each sample mean has a 0.05 probability of exceeding 25.718 (when 
pt =25.0), we would expect 75(0.05), or 3.75, of the data sets to result in a “reject 


Table 6.2.1 
We = 25.718? ey 25 71S yy = 25-718? 


25.133 no 25.259 no 25.200 no 
24.602 no 25.866 yes 25.653 no 
24.587 no 25.623 no 25.198 no 
24.945 no 24.550 no 24.758 no 
24.761 no 24.919 no 24.842 no 
24.177 no 24.770 no 25.383 no 
25.306 no 25.080 no 24.793 no 
25.601 no 25.307 no 24.874 no 
24.121 no 24.004 no 25.513 no 
25.516 no 24.772 no 24.862 no 
24.547 no 24.843 no 25.034 no 
24.235 no 29171 yes 25.150 no 
25.809 yes 24.233 no 24.639 no 
25.719 yes 24.853 no 24.314 no 
25.307 no 25.018 no 25.045 no 
25.011 no 25.176 no 24.803 no 
24.783 no 24.750 no 24.780 no 
25.196 no 25.578 no 25.691 no 
24.577 no 24.807 no 24.207 no 
24.762 no 24.298 no 24.743 no 
25.805 yes 24.807 no 24.618 no 
24.380 no 24.346 no 25.401 no 
25.224 no 25.261 no 24.958 no 
24.371 no 25.062 no 25.678 no 
25.033 no 25.391 no 24.795 no 
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Ho” conclusion. Reassuringly, the observed number of incorrect inferences (= 5) is 
quite close to that expected value. 


Definition 6.2.1. If Ho: = wu, is rejected using a 0.05 decision rule, the 
difference between y and J, is said to be statistically significant. 


Expressing Decision Rules in Terms of Z Ratios 


As we have seen, decision rules are statements that spell out the conditions under 
which a null hypothesis is to be rejected. The format of those statements, though, 
can vary. Depending on the context, one version may be easier to work with than 
another. 

Recall Equation 6.2.1. Rejecting Ho: 1 = 25.0 when 


2.4 
y>y*=25.0+ 1.64- —— =25.718 
yay 0 
is clearly equivalent to rejecting Hy) when 
y — 25.0 
a TBA (6.2.2) 
2.4/./30 


(if one rejects the null hypothesis, the other will necessarily do the same). 


: Y-25.0 
We know from Chapter 4 that the random variable 34/730 


distribution (if 2 = 25.0). When a particular y is substituted for Y (as in Inequality 


6.2.2), we call TNE the observed z. Choosing between Hp and H is typically (and 


most conveniently) done in terms of the observed z. In Section 6.4, though, we will 
encounter certain questions related to hypothesis testing that are best answered by 
phrasing the decision rule in terms of y*. 


has a standard normal 


Definition 6.2.2. Any function of the observed data whose numerical value 
dictates whether Hp is accepted or rejected is called a test statistic. The set of 
values for the test statistic that result in the null hypothesis being rejected is 
called the critical region and is denoted C. The particular point in C that sepa- 
rates the rejection region from the acceptance region is called the critical value. 


y—25.0 
2.4//30 
If the sample mean is used, the associated critical region would be written 


C ={y: y> 25.718} 


Comment For the gas mileage example, both y and qualify as test statistics. 


(and 25.718 is the critical value). If the decision rule is framed in terms of a Z ratio, 


y — 25.0 
C= 427; z= —— > 1.64 
2.4//30 


In this latter case, the critical value is 1.64. 


Definition 6.2.3. The probability that the test statistic lies in the critical region 
when Hp is true is called the level of significance and is denoted a. 
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Comment In principle, the value chosen for w should reflect the consequences of 
making the mistake of rejecting Hp when Hb is true. As those consequences get more 
severe, the critical region C should be defined so that a gets smaller. In practice, 
though, efforts to quantify the costs of making incorrect inferences are arbitrary at 
best. In most situations, experimenters abandon any such attempts and routinely set 
the level of significance equal to 0.05. If another a is used, it is likely to be either 
0.001, 0.01, or 0.10. 

Here again, the similarity between hypothesis testing and courtroom protocol is 
worth keeping in mind. Just as experimenters can make a larger or smaller to reflect 
the consequences of mistakenly rejecting Hy when Hp is true, so can juries demand 
more or less evidence to return a conviction. For juries, any such changes are usually 
dictated by the severity of the possible punishment. A grand jury deciding whether 
or not to indict someone for fraud, for example, will inevitably require less evidence 
to return a conviction than will a jury impaneled for a murder trial. 


One-Sided Versus Two-Sided Alternatives 


In most hypothesis tests, Hp consists of a single number, typically the value of the 
parameter that represents the status quo. The “25.0” in Ho: 4 = 25.0, for example, is 
the mileage that would be expected when the additive has no effect. If the mean of a 
normal distribution is the parameter being tested, our general notation for the null 
hypothesis will be Ho: u = (4., where 1, is the status quo value of jw. 

Alternative hypotheses, by way of contrast, invariably embrace entire ranges of 
parameter values. If there is reason to believe before any data are collected that the 
parameter being tested is necessarily restricted to one particular “side” of Ho, then 
H, is defined to reflect that limitation and we say that the alternative hypothesis is 
one-sided. Two variations are possible: H, can be one-sided fo the left (Hy: < Uo) 
or it can be one-sided to the right (H: 4 > (o). If no such a priori information is 
available, the alternative hypothesis needs to accommodate the possibility that the 
true parameter value might lie on either side of jo. Any such alternative is said to 
be two-sided. For testing Ho: 4 = [o, the two-sided alternative is written Hy: 4 Uo. 

In the gasoline example, it was tacitly assumed that the additive either would 
have no effect (in which case = 25.0 and Hp would be true) or would increase 
mileage (implying that the true mean would lie somewhere “to the right” of Ho). 
Accordingly, we wrote the alternative hypothesis as H): 4 > 25.0. If we had reason 
to suspect, though, that the additive might interfere with the gasoline’s combustibil- 
ity and possibly decrease mileage, it would have been necessary to use a two-sided 
alternative (Hj: 4 425.0). 

Whether the alternative hypothesis is defined to be one-sided or two-sided is 
important because the nature of H; plays a key role in determining the form of the 
critical region. We saw earlier that the 0.05 decision rule for testing 


Ab: h= 25.0 
versus 
Ay: uw > 25.0 


y-25.0 
2.4//30 
substantially /arger than 25.0 will we reject Ho. 


If the alternative hypothesis had been two-sided, sample means either much 
smaller than 25.0 or much larger than 25.0 would be evidence against Ho (and in 


calls for Hp to be rejected if 


> 1.64. That is, only if the sample mean is 


Figure 6.2.4 


Theorem 
6.2.1 


Example 
6.2.1 
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support of H,). Moreover, the 0.05 probability associated with the critical region 
C would be split into two halves, with 0.025 being assigned to the left-most por- 
tion of C, and 0.025 to the right-most portion. From Appendix Table A.1, though, 
P(Z < —1.96) = P(Z = 1.96) = 0.025, so the two-sided 0.05 decision rule would call 


for Ho: u = 25.0 to be rejected if ae is either (1) < —1.96 or (2) > 1.96. 


Testing Ho: [t = lo (o Known) 


Let z . be the number having the property that P(Z > z.) =a. Values for z_ can 
be found from the standard normal cdf tabulated in Appendix A.1. If a = 0.05, for 
example, z.o5 = 1.64 (see Figure 6.2.4). Of course, by the symmetry of the normal 
curve, —Zq has the property that P(Z < —z,) =a. 


0 gg = 1.64 

Let y\, y2,.--; ¥, be arandom sample of size n from a normal distribution where o is 
— Yo 

known. Let z= olga 


a. To test Ho: W = Lo versus Hy: 4 > [Lo at the a level of significance, reject Ho if z > Za. 
b. To test Ho: 1 = Lo versus Hy: fh < [Ly at the a level of significance, reject Ho if z<—Za. 
c. To test Ho: 1 = Uo versus Hy: up # Lo at the a level of significance, reject Ho if z is 


either (1) < —Zq/2 oF (2) > Zw/2 


As part of a “Math for the Twenty-First Century” initiative, Bayview High was cho- 
sen to participate in the evaluation of a new algebra and geometry curriculum. In 
the recent past, Bayview’s students were considered “typical,” having earned scores 
on standardized exams that were very consistent with national averages. 

Two years ago, a cohort of eighty-six Bayview sophomores, all randomly 
selected, were assigned to a special set of classes that integrated algebra and geom- 
etry. According to test results that have just been released, those students averaged 
502 on the SAT-I math exam; nationwide, seniors averaged 494 with a standard devi- 
ation of 124. Can it be claimed at the a = 0.05 level of significance that the new 
curriculum had an effect? 

To begin, we define the parameter py to be the true average SAT-I math score 
that we could expect the new curriculum to produce. The obvious “status quo” 
value for yw is the current national average—that is, 4, = 494. The alternative 
hypothesis here should be two-sided because the possibility certainly exists that a 
revised curriculum—however well intentioned—would actually ower a student’s 
achievement. 

According to part (c) of Theorem 6.2.1, then, we should reject Ho: ~ = 494 in 
favor of H,: 1 #494 at the a =0.05 level of significance if the test statistic z is either 
(1) < —z.925(= —1.96) or (2) > z.925(= 1.96). But y = 502, so 
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502 — 494 
z= —__— =0.60 
124//86 


implying that our decision should be “Fail to reject Ho.” Even though Bayview’s 
502 is eight points above the national average, it does not follow that the improve- 
ment was due to the new curriculum: An increase of that magnitude could easily 
have occurred by chance, even if the new curriculum had no effect whatsoever (see 


Figure 6.2.5). 
0.4 >. 
| Ss @ 
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7 N 
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Figure 6.2.5 


Comment If the null hypothesis is not rejected, we should phrase the conclusion 
as “Fail to reject Hy” rather than “Accept Hp.” Those two statements may seem to 
be the same, but, in fact, they have very different connotations. The phrase “Accept 
Ho” suggests that the experimenter is concluding that Hp is true. But that may not 
be the case. In a court trial, when a jury returns a verdict of “Not guilty,” they are 
not saying that they necessarily believe that the defendant is innocent. They are 
simply asserting that the evidence—in their opinion—is not sufficient to overturn 
the presumption that the defendant is innocent. That same distinction applies to 
hypothesis testing. If a test statistic does not fall in the critical region (which was 
the case in Example 6.2.1), the proper interpretation is to conclude that we “Fail to 
reject Ho.” 


The P-Value 


There are two general ways to quantify the amount of evidence against Hp that is 
contained in a given set of data. The first involves the level of significance concept 
introduced in Definition 6.2.3. Using that format, the experimenter selects a value 
for a (usually 0.05 or 0.01) before any data are collected. Once o is specified, a cor- 
responding critical region can be identified. If the test statistic falls in the critical 
region, we reject Ho at the a level of significance. Another strategy is to calculate a 
P-value. 


Definition 6.2.4. The P-value associated with an observed test statistic is the 
probability of getting a value for that test statistic as extreme as or more 
extreme than what was actually observed (relative to H,) given that Hp is 
true. 


Questions 


6.2.1. State the decision rule that would be used to test 
the following hypotheses. Evaluate the appropriate test 


Example 
6.2.2 
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Comment Test statistics that yield small P-values should be interpreted as evidence 
against Hy. More specifically, if the P-value calculated for a test statistic is less than 
or equal to a, the null hypothesis can be rejected at the a level of significance. Or, 
put another way, the P-value is the smallest a at which we can reject Ho. 


Recall Example 6.2.1. Given that Ho: 4 = 494 is being tested against H): 1 4 494, 
what P-value is associated with the calculated test statistic, z=0.60, and how should 
it be interpreted? 


If Ho: uw = 494 is true, the random variable Z = ae 


124/./86 
pdf. Relative to the two-sided H), any value of Z greater than or equal to 0.60 or 


less than or equal to —0.60 qualifies as being “‘as extreme as or more extreme than” 
the observed z. Therefore, by Definition 6.2.4, 


P-value = P(Z > 0.60) + P(Z < —0.60) 


has a standard normal 


= 0.2743 + 0.2743 
= 0.5486 
(see Figure 6.2.6). 
0.4.4~ 
| yy P-value = 0.2743 + 0.2743 
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Figure 6.2.6 


As noted in the preceding comment, P-values can be used as decision rules. In 
Example 6.2.1, 0.05 was the stated level of significance. Having determined here that 
the P-value associated with z = 0.60 is 0.5486, we know that Ho: u = 494 would not 
be rejected at the given a. Indeed, the null hypothesis would not be rejected for any 
value of w up to and including 0.5486. 

Notice that the P-value would have been halved had Hj; been one-sided. Sup- 
pose we were confident that the new algebra and geometry classes would not 
lower a student’s math SAT. The appropriate hypothesis test in that case would be 
Ho: 4 = 494 versus H;: 4 > 494. Moreover, only values in the right-hand tail of fz(z) 
would be considered more extreme than the observed z = 0.60, so 


P-value = P(Z > 0.60) = 0.2743 = 


(a) Ao: “= 120 versus HM: w < 120; y=114.2,n=25,0 = 
18, a =0.08 


statistic and state your conclusion. 
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(b) Ao: 4 = 42.9 versus H,: u 442.9; y=45.1,n= 16,0 
3.2,a=0.01 

(c) Ao: 4 = 14.2 versus Hi: u > 14.2; y= 15.8,n=9,0 = 
4.1,a=0.13 


6.2.2. An herbalist is experimenting with juices extracted 
from berries and roots that may have the ability to affect 
the Stanford-Binet IQ scores of students afflicted with 
mild cases of attention deficit disorder (ADD). A ran- 
dom sample of twenty-two children diagnosed with the 
condition have been drinking Brain-Blaster daily for two 
months. Past experience suggests that children with ADD 
score an average of 95 on the IQ test with a standard devi- 
ation of 15. If the data are to be analyzed using the a =0.06 
level of significance, what values of y would cause Ho to be 
rejected? Assume that H, is two-sided. 


6.2.3. (a) Suppose Ap: = mu, is rejected in favor of 
A: # ww, at the a = 0.05 level of significance. Would 
Hy necessarily be rejected at the a = 0.01 level of 
significance? 

(b) Suppose Ap: 4 = pl, is rejected in favor of Hi: u Au, 
at the a = 0.01 level of significance. Would Ho necessarily 
be rejected at the a = 0.05 level of significance? 


6.2.4. Company records show that drivers get an aver- 
age of 32,500 miles on a set of Road Hugger All-Weather 
radial tires. Hoping to improve that figure, the company 
has added a new polymer to the rubber that should help 
protect the tires from deterioration caused by extreme 
temperatures. Fifteen drivers who tested the new tires 
have reported getting an average of 33,800 miles. Can the 
company claim that the polymer has produced a statis- 
tically significant increase in tire mileage? Test Ho: uw = 
32,500 against a one-sided alternative at the a = 0.05 
level. Assume that the standard deviation (c) of the tire 
mileages has not been affected by the addition of the 
polymer and is still 4000 miles. 


6.2.5. If Ho: w=, is rejected in favor of Hy: 4 > to, will 
it necessarily be rejected in favor of Hy: 4 4 “u,? Assume 
that w remains the same. 


6.2.6. A random sample of size 16 is drawn from a notr- 
mal distribution having o = 6.0 for the purpose of testing 
Ao: u = 30 versus H,: 4 4 30. The experimenter chooses 
to define the critical region C to be the set of sam- 
ple means lying in the interval (29.9, 30.1). What level 
of significance does the test have? Why is (29.9, 30.1) 
a poor choice for the critical region? What range of y 
values should comprise C, assuming the same a is to 
be used? 


6.2.7. Recall the breath analyzers described in Exam- 
ple 4.3.5. The following are thirty blood alcohol deter- 
minations made by Analyzer GTE-10, a three-year-old 


unit that may be in need of recalibration. All thirty 
measurements were made using a test sample on which a 
properly adjusted machine would give a reading of 12.6%. 


12.3, 12.7 13.6 12.7 12.9 12.6 
12.6 13.1 12.6 13.1 12.7 125 
13.2 12.8 124 126 124 12.4 
13.1 12.9 13.33 12.6 126 12.7 
13.1 124 124 131 124 12.9 


(a) If uw denotes the true average reading that Ana- 
lyzer GTE-10 would give for a person whose blood 
alcohol concentration is 12.6%, test 


Ab: L= 12.6 
versus 
My £ 12.6 


at the a = 0.05 level of significance. Assume that 
o =0.4. Would you recommend that the machine be 
readjusted? 

What statistical assumptions are implicit in the 
hypothesis test done in part (a)? Is there any rea- 
son to suspect that those assumptions may not be 
satisfied? 


(b 


— 


6.2.8. Calculate the P-values for the hypothesis tests indi- 
cated in Question 6.2.1. Do they agree with your decisions 
on whether or not to reject Ho? 


6.2.9. Suppose Ho: 4 = 120 is tested against H,: ~ 4 120. 
If c= 10 and n= 16, what P-value is associated with the 
sample mean y = 122.3? Under what circumstances would 
Hy be rejected? 


6.2.10 As a class research project, Rosaura wants to see 
whether the stress of final exams elevates the blood pres- 
sures of freshmen women. When they are not under any 
untoward duress, healthy eighteen-year-old women have 
systolic blood pressures that average 120mm Hg with a 
standard deviation of 12mm Hg. If Rosaura finds that the 
average blood pressure for the fifty women in Statistics 
101 on the day of the final exam is 125.2, what should she 
conclude? Set up and test an appropriate hypothesis. 


6.2.11. As input for a new inflation model, economists 
predicted that the average cost of a hypothetical “food 
basket” in east Tennessee in July would be $145.75. The 
standard deviation (c) of basket prices was assumed to be 
$9.50, a figure that has held fairly constant over the years. 
To check their prediction, a sample of twenty-five baskets 
representing different parts of the region were checked in 
late July, and the average cost was $149.75. Let a=0.05. Is 
the difference between the economists’ prediction and the 
sample mean statistically significant? 


Theorem 
6.3.1 
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6.3 Testing Binomial Data—Ho: p = po 


Suppose a set of data—k, k2,...,k,—represents the outcomes of n Bernoulli trials, 
where k; = 1 or 0, depending on whether the ith trial ended in success or failure, 
respectively. If p = P(ith trial ends in success) is unknown, it may be appropriate to 
test the null hypothesis Ho: p = p,, where p, is some particularly relevant (or status 
quo) value of p. Any such procedure is called a binomial hypothesis test because 
the appropriate test statistic is the sum of the k;’s—call it k—and we know from 
Theorem 3.2.1 that the total number of successes, X, in a series of independent trials 
has a binomial distribution, 


pxtk; p)= P(X =kh= (J) oa =p’ k=0,1,2,...,n 


Two different procedures for testing Ho: p = p, need to be considered, the 
distinction resting on the magnitude of n. If 


0 <npo — 3VnpoU — Po) <nPpo t3VnpoUl — Po) <n (6.3.1) 


a “large-sample” test of Ho: p = p. is done, based on an approximate Z ratio. 
Otherwise, a “small-sample” decision rule is used, one where the critical region 
is defined in terms of the exact binomial distribution associated with the random 
variable X. 


A Large-Sample Test for the Binomial Parameter p 


Suppose the number of observations, n, making up a set of Bernoulli random vari- 
ables is sufficiently large that Inequality 6.3.1 is satisfied. We know in that case from 


‘ 7 X—npy ‘ 
Section 4.3 that the random variable Tapa has approximately a standard normal 


aay close to zero, of course, would be evidence in 
jee) =0 when p= p,]. Conversely, the credibility 
of Ho: p = po clearly diminishes as Te moves further and further away from 


zero. The large-sample test of Ho: p = po, then, takes the same basic form as the test 
of Ho: 4 = [lo in Section 6.2. 


pdf, fz(z) if p= po. Values of 
favor of Ho: p= Po [since E ( 


Let ki, ko,...,kn be a random sample of n Bernoulli random variables for which 
0 < npo —3VnpoT — Po) <npo +3Vnpo( — Po) <n. Letk=k, +ky+---+k, denote 
“ cree) . — k—npo 

the total number of “successes” in the n trials. Define z= Tapp 

a. To test Ho: p = Po versus H\: p > po at the a level of significance, reject Ho if 
22> Za: 

b. To test Ho: p = po versus Hy: p < po at the a level of significance, reject Ho if 
z < —Za- 

c. To test Ho: p = Po versus Hi: p 4 po at the a level of significance, reject Ho if z is 
either (1) < —Zq/2 or (2) > Za/2. 
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Case Study 6.3.1 


In gambling parlance, a point spread is a hypothetical increment added to the 
score of the presumably weaker of two teams playing. By intention, its magni- 
tude should have the effect of making the game a toss-up; that is, each team 
should have a 50% chance of beating the spread. 

In practice, setting the “line” on a game is a highly subjective endeavor, 
which raises the question of whether or not the Las Vegas crowd actually gets 
it right (113). Addressing that issue, a recent study examined the records of 
124 National Football League games; it was found that in sixty-seven of the 
matchups (or 54%), the favored team beat the spread. Is the difference between 
54% and 50% small enough to be written off to chance, or did the study uncover 
convincing evidence that oddsmakers are not capable of accurately quantifying 
the competitive edge that one team holds over another? 

Let p = P(Favored team beats spread). If p is any value other than 0.50, 
the bookies are assigning point spreads incorrectly. To be tested, then, are the 
hypotheses 


Ho: p =0.50 
versus 
A: p 40.50 
Suppose 0.05 is taken to be the level of significance. 
In the terminology of Theorem 6.3.1, n = 124, p, =0.50, and 


—_ 1 if favored team beats spread in ith game 
‘| 0. if favored team does not beat spread in ith game 


fori=1,2,..., 124. Therefore, the sum k =k; +k2+---+kj24 denotes the total 
number of times the favored team beat the spread. 

According to the two-sided decision rule given in part (c) of Theorem 6.3.1, 
the null hypothesis should be rejected if z is either less than or equal to 
—1.96 (= —z.05/2) or greater than or equal to 1.96 (=z.95/2). But 


67 — 124(0.50 
z= = = 0.90 
v 124(0.50) (0.50) 


does not fall in the critical region, so Hp : p =0.50 should not be rejected at the 


a = 0.05 level of significance. The outcomes of these 124 games, in other words, 
are entirely consistent with the presumption that bookies know which of two 
teams is better, and by how much. 


About the Data Here the observed z is 0.90 and H; is two-sided, so the P-value 
is 0.37: 


P-value = P(Z < —0.90) + P(Z = 0.90) = 0.1841 + 0.1841 = 0.37 


According to the Comment following Definition 6.2.4, then, the conclusion could be 
written 


“Fail to reject Hp for any a < 0.37.” 


Would it also be correct to summarize the data with the statement 
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“Reject Hy at the a = 0.40 level of significance”? 


In theory, yes; in practice, no. For all the reasons discussed in Section 6.2, the 
rationale underlying hypothesis testing demands that w be kept small (and “small” 
usually means less than or equal to 0.10). 

It is typically the experimenter’s objective to reject Ho, because Hp represents 
the status quo, and there is seldom a compelling reason to devote time and money to 
a study for the purpose of confirming what is already believed. That being the case, 
experimenters are always on the lookout for ways to increase their probability of 
rejecting Hy. There are a number of entirely appropriate actions that can be taken to 
accomplish that objective, several of which will be discussed in Section 6.4. However, 
raising a above 0.10 is not one of the appropriate actions; and raising @ as high as 
0.40 would absolutely never be done. 


Case Study 6.3.2 


There is a theory that people may tend to “postpone” their deaths until after 
some event that has particular meaning to them has passed (134). Birthdays, 
a family reunion, or the return of a loved one have all been suggested as the 
sorts of personal milestones that might have such an effect. National elections 
may be another. Studies have shown that the mortality rate in the United States 
drops noticeably during the Septembers and Octobers of presidential election 
years. If the postponement theory is to be believed, the reason for the decrease 
is that many of the elderly who would have died in those two months “hang on” 
until they see who wins. 

Some years ago, a national periodical reported the findings of a study 
that looked at obituaries published in a Salt Lake City newspaper. Among 
the 747 decedents, the paper identified that only 60, or 8.0%, had died in the 
three-month period preceding their birth months (123). If individuals are dying 
randomly with respect to their birthdays, we would expect 25% to die during 
any given three-month interval. What should we make, then, of the decrease 
from 25% to 8%? Has the study provided convincing evidence that the death 
months reported for the sample do not constitute a random sample of months? 

Imagine the 747 deaths being divided into two categories: those that 
occurred in the three-month period prior to a person’s birthday and those that 
occurred at other times during the year. Let k; = 1 if the ith person belongs to 
the first category and k; = 0, otherwise. Then k =k; +k2+---+747 denotes the 
total number of deaths in the first category. The latter, of course, is the value of 
a binomial random variable with parameter p, where 


p = P (Person dies in three months prior to birth month) 


If people do not postpone their deaths (to wait for a birthday), p should be 
> or 0.25; if they do, p will be something Jess than 0.25. Assessing the decrease 
from 25% to 8%, then, is done with a one-sided binomial hypothesis test: 


Ho: p =0.25 
versus 
Ay: p <0.25 


(Continued on next page) 
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Example 


6.3.1 


(Case Study 6.3.2 continued) 


Let a =0.05. According to part (b) of Theorem 6.3.1, Ho should be rejected 
if 
k— 2) 
z= wpe <2 951.64 
V NPo(l s Po) 
Substituting for k,n, and p., we find that the test statistic falls far to the left of 
the critical value: 


60 — 747(0.25) 
z= =—10.7 
/747(0.25) (0.75) 

The evidence is overwhelming, therefore, that the decrease from 25% 
to 8% is due to something other than chance. Explanations other than the 
postponement theory, of course, may be wholly or partially responsible for 
the nonrandom distribution of deaths. Still, the data show a pattern entirely 
consistent with the notion that we do have some control over when we die. 


About the Data _ A similar conclusion was reached in a study conducted among 
the Chinese community living in California. The “significant event” in that case 
was not a birthday—it was the annual Harvest Moon festival, a celebration that 
holds particular meaning for elderly women. Based on census data tracked over a 
twenty-four-year period, it was determined that fifty-one deaths among elderly Chi- 
nese women should have occurred during the week before the festivals, and fifty-two 
deaths after the festivals. In point of fact, thirty-three died the week before and 
seventy died the week after (22). 


A Small-Sample Test for the Binomial Parameter p 


Suppose that k;, ko,...,k, is arandom sample of Bernoulli random variables where 
n is too small for Inequality 6.3.1 to hold. The decision rule, then, for testing 
Ho: p = Po that was given in Theorem 6.3.1 would not be appropriate. Instead, the 
critical region is defined by using the exact binomial distribution (rather than a 
normal approximation). 


Suppose that n = 19 elderly patients are to be given an experimental drug designed 
to relieve arthritis pain. The standard treatment is known to be effective in 85% of 
similar cases. If p denotes the probability that the new drug will reduce a patient’s 
pain, the researcher wishes to test 


Ho: p=0.85 
versus 
Hi: p £0.85 


The decision will be based on the magnitude of k, the total number in the sample for 
whom the durg is effective—that is, on 


kK=k t+ht+---+kio 
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where 


if the new drug fails to relieve ith patient’s pain 


0 
i= 1 if the new drug does relieve ith patient’s pain 
What should the decision rule be if the intention is to keep a somewhere near 
10%? [Note that Theorem 6.3.1 does not apply here because Inequality 6.3.1 is not 


satisfied —specifically, np, + 3./npo(1 — po) = 19(0.85) + 3./19(0.85) (0.15) = 20.8 is 


not less than n(= 19).] 


If the null hypothesis is true, the expected number of successes would be np, = 
19(0.85). or 16.2. It follows that values of k to the extreme right or extreme left of 
16.2 should constitute the critical region. 


MIB > pdf; 
SUBC > binomial 19 0.85. 


Probability Density Function 


Binomial with n= 19 and p=0.85 


x P(X=x) 
6 0.000000 
7 0.000002 
8 0.000018 
9 0.000123 — P(X < 13) = 0.053696 
10 0.000699 
11 0.003242 
12 0.012246 
13 0.037366 
14 0.090746 
15 0.171409 
16 0.242829 
17 0.242829 
18 0.152892 
19 0.045599 — P(X = 19) = 0.045599 


Figure 6.3.1 


Figure 6.3.1 is a Minitab printout of px (k) = () (0.85)* (0.15) !9-*. By inspection, 
we can see that the critical region 


C={k:k<13 or k=19} 


would produce an @ close to the desired 0.10 (and would keep the probabilities 
associated with the two sides of the rejection region roughly the same). In random 


variable notation, 


P(X €C| Ap is true) = P(X < 13| p=0.85) + P(X = 19| p=0.85) 


Questions 


6.3.1. Commercial fishermen working certain parts of 
the Atlantic Ocean sometimes find their efforts hindered 
by the presence of whales. Ideally, they would like to 
scare away the whales without frightening the fish. One 
of the strategies being experimented with is to transmit 
underwater the sounds of a killer whale. On the fifty- 
two occasions that technique has been tried, it worked 
twenty-four times (that is, the whales immediately left 


= 0.053696 + 0.045599 
= 0.099295 
~0.10 ia 


the area). Experience has shown, though, that 40% of 
all whales sighted near fishing boats leave of their own 
accord, probably just to get away from the noise of the 
boat. 


(a) Let p= P(Whale leaves area after hearing sounds of 
killer whale). Test Hy: p =0.40 versus H;: p > 0.40 at 
the a =0.05 level of significance. Can it be argued on 
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the basis of these data that transmitting underwater 
predator sounds is an effective technique for clearing 
fishing waters of unwanted whales? 

(b) Calculate the P-value for these data. For what values 
of a would Hy be rejected? 


6.3.2. Efforts to find a genetic explanation for why 
certain people are right-handed and others left-handed 
have been largely unsuccessful. Reliable data are diffi- 
cult to find because of environmental factors that also 
influence a child’s “handedness.” To avoid that compli- 
cation, researchers often study the analogous problem 
of “pawedness” in animals, where both genotypes and 
the environment can be partially controlled. In one such 
experiment (27), mice were put into a cage having a feed- 
ing tube that was equally accessible from the right or 
the left. Each mouse was then carefully watched over a 
number of feedings. If it used its right paw more than 
half the time to activate the tube, it was defined to be 
“right-pawed.” Observations of this sort showed that 67% 
of mice belonging to strain A/J are right-pawed. A simi- 
lar protocol was followed on a sample of thirty-five mice 
belonging to strain A/HeJ. Of those thirty-five, a total of 
eighteen were eventually classified as right-pawed. Test 
whether the proportion of right-pawed mice found in the 
A/HeJ sample was significantly different from what was 
known about the A/J strain. Use a two-sided alternative 
and let 0.05 be the probability associated with the critical 
region. 


6.3.3. Defeated in his most recent attempt to win a con- 
gressional seat because of a sizeable gender gap, a politi- 
cian has spent the last two years speaking out in favor 
of women’s rights issues. A newly released poll claims to 
have contacted a random sample of 120 of the politician’s 
current supporters and found that 72 were men. In the 
election that he lost, exit polls indicated that 65% of those 
who voted for him were men. Using an a = 0.05 level of 
significance, test the null hypothesis that the proportion 
of his male supporters has remained the same. Make the 
alternative hypothesis one-sided. 


6.3.4. Suppose Ap: p = 0.45 is to be tested against H,: p > 
0.45 at the a = 0.14 level of significance, where p = P(ith 
trial ends in success). If the sample size is 200, what is 
the smallest number of successes that will cause Hp to be 
rejected? 


6.3.5. Recall the median test described in Example 5.3.2. 
Reformulate that analysis as a hypothesis test rather than 
a confidence interval. What P-value is associated with the 
outcomes listed in Table 5.3.3? 


6.3.6. Among the early attempts to revisit the death post- 
ponement theory introduced in Case Study 6.3.2 was an 
examination of the birth dates and death dates of 348 USS. 
celebrities (134). It was found that 16 of those individuals 
had died in the month preceding their birth month. Set up 
and test the appropriate Hy against a one-sided H,. Use 
the 0.05 level of significance. 


6.3.7. What a levels are possible with a decision rule of 
the form “Reject Hp if k => k*” when Hp: p = 0.5 is to be 
tested against H,: p > 0.5 using a random sample of size 
n=7? 


6.3.8. The following is a Minitab printout 
of the binomial pdf px(k) = (2) (0.6)(0.4)?*, 
k=0,1,...,9. Suppose Ho: p = 0.6 is to be tested against 


H,: p > 0.6 and we wish the level of significance to be 
exactly 0.05. Use Theorem 2.4.1 to combine two different 
critical regions into a single randomized decision rule for 
which w = 0.05. 


MTB > pdf; 

SUBC > binomial9 0.6. 
Probability Density Function 
Binomial with n=9 and p=0.6 


x P(X =x) 
0 0.000262 
1 0.003539 
2 0.021234 
3 0.074318 
4 0.167215 
) 0.250823 
6 0.250823 
7 0.161243 
8 0.060466 
9 0.010078 


6.3.9. Suppose Hp: p = 0.75 is to be tested against H): p < 
0.75 using a random sample of size n =7 and the decision 
rule “Reject Ap if k <3.” 


(a) What is the test’s level of significance? 
(b) Graph the probability that Hp will be rejected as a 
function of p. 


6.4 Type | and Type II Errors 


The possibility of drawing incorrect conclusions is an inevitable byproduct of 
hypothesis testing. No matter what sort of mathematical facade is laid atop the 
decision-making process, there is no way to guarantee that what the test tells us 
is the truth. One kind of error—rejecting Hp when Hp is true—figured prominently 
in Section 6.3: It was argued that critical regions should be defined so as to keep the 
probability of making such errors small, often on the order of 0.05. 


Figure 6.4.1 
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In point of fact, there are two different kinds of errors that can be committed 
with any hypothesis test: (1) We can reject Ho when Ab is true and (2) we can fail to 
reject Hy when A) is false. These are called Type I and Type II errors, respectively. 
At the same time, there are two kinds of correct decisions: (1) We can fail to reject 
a true Hp and (2) we can reject a false Ho. Figure 6.4.1 shows these four possible 
“Decision/State of nature” combinations. 


True State of Nature 


Ap is true Hy is true 
Fail to Correct Type II 
reject Hi decision error 
Our J 0 
Decision 
: Type I Correct 
Reject Ho yP te 
error decision 


Computing the Probability of Committing a Type | Error 


Once an inference is made, there is no way to know whether the conclusion reached 
was correct. It is possible, though, to calculate the probability of having made an 
error, and the magnitude of that probability can help us better understand the 
“power” of the hypothesis test and its ability to distinguish between Hp and H). 
Recall the fuel additive example developed in Section 6.2: Ho: 4 = 25.0 was to 
be tested against Hj: 4 > 25.0 using a sample of size n = 30. The decision rule stated 
that Ho should be rejected if y, the average mpg with the new additive, equalled or 
exceeded 25.718. In that case, the probability of committing a Type I error is 0.05: 


P(Type I error) = P(Reject Ho | Ho is true) 
= P(Y > 25.718 | w= 25.0) 


_ p{ 1 =25.0 , 25.718 — 25.0 
~— \2.4/./30~ — 2.4//30 


= P(Z > 1.64) =0.05 


Of course, the fact that the probability of committing a Type I error equals 0.05 
should come as no surprise. In our earlier discussion of how “beyond reasonable 
doubt” should be interpreted numerically, we specifically chose the critical region so 
that the probability of the decision rule rejecting Hp when Hp is true would be 0.05. 

In general, the probability of committing a Type I error is referred to as a test’s 
level of significance and is denoted a (recall Definition 6.2.3). The concept is a crucial 
one: The level of significance is a single-number summary of the “rules” by which the 
decision process is being conducted. In essence, a reflects the amount of evidence 
the experimenter is demanding to see before abandoning the null hypothesis. 


Computing the Probability of Committing a Type II Error 


We just saw that calculating the probability of a Type I error is a nonproblem: 
There are no computations necessary, since the probability equals whatever value 
the experimenter sets a priori for a. A similar situation does not hold for Type 
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Figure 6.4.2 


Figure 6.4.3 


II errors. To begin with, Type II error probabilities are not specified explicitly by 
the experimenter; also, each hypothesis test has an infinite number of Type II error 
probabilities, one for each value of the parameter admissible under A. 

As an example, suppose we want to find the probability of committing a Type 
II error in the gasoline experiment if the true jz (with the additive) were 25.750. By 
definition, 


P(Type II error | 4 = 25.750) = P(We fail to reject Ho | 4 = 25.750) 
= P(Y <25.718| 4 =25.750) 


¥-25.75 25.718 — 25.75 
< 
2.4//30 2.4/+/30 
= P(Z <—0.07) =0.4721 


So, even if the new additive increased the fuel economy to 25.750 mpg (from 
25 mpg), our decision rule would be “tricked” 47% of the time: that is, it would 
tell us on those occasions not to reject Ho. 

The symbol for the probability of committing a Type II error is 8. Figure 6.4.2 
shows the sampling distribution of Y when jz = 25.0 (i.e., when Ap is true) and when 
jt = 25.750 (A, is true); the areas corresponding to a and £ are shaded. 


1.0 
Sampling 7 * AN Sampling 
distribution 7 * , \ distribution 
of Ywhen™, 4 << of Y when 
H, is true 7 0.5 F _ ‘ p= 25.75 


25.718 


Accept H, ~———— Reject Hp 


1.0 
Sampling Ph 4 \ ra . Sampling 
distribution / \ / \ distribution 
of Y ea : ‘ 7 —— of Y when 
Hy istrue  / ©. \ / w= 26.8 


Clearly, the magnitude of £ is a function of the presumed value for w. If, for 
example, the gasoline additive is so effective as to raise fuel efficiency to 26.8 mpg, 


Figure 6.4.4 
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the probability that our decision rule would lead us to make a Type I error is a much 
smaller 0.0068: 

P(Type II error | 4 = 26.8) = P(We fail to reject Ho | w = 26.8) 
Y—26.8 25.718 —26.8 


= P(Y <25.718 | 4 =26.8) =P < 
i (is 2.4//30 


= P(Z <—2.47) = 0.0068 
(See Figure 6.4.3.) 


Power Curves 


If 6 is the probability that we fail to reject Ho when A, is true, then 1 — # is the 
probability of the complement—that we reject Hy) when Hy is true. We call 1 — B 
the power of the test; it represents the ability of the decision rule to “recognize” 
(correctly) that Hp is false. 

The alternative hypothesis H, usually depends on a parameter, which makes 
1— a function of that parameter. The relationship they share can be pictured by 
drawing a power curve, which is simply a graph of 1 — 8 versus the set of all possible 
parameter values. 

Figure 6.4.4 shows the power curve for testing 


Ab: h= 25.0 
versus 
Ay: > 25.0 


where jp is the mean of a normal distribution with o = 2.4, and the decision rule is 
“Reject Ho if y > 25.718.” The two marked points on the curve represent the (jz, 1 — 
B) pairs just determined, (25.75, 0.5297) and (26.8, 0.9932). One other point can 
be gotten for every power curve, without doing any calculations: When pz = Wo (the 
value specified by Ho), 1 — 8 =a. Of course, as the true mean gets further and further 
away from the Hp mean, the power will converge to 1. 


1.0 


Power = 0.72 = 


1-8, 
0.5 


Power = 0.29 i 


25.00 25.50 26.00 26.50 27.00 


Presumed value for jy 


Power curves serve two different purposes. On the one hand, they completely 
characterize the performance that can be expected from a hypothesis test. In 
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Figure 6.4.5 


\ ra Method A 


Figure 6.4.4, for example, the two arrows show that the probability of rejecting 
Ho: 4 = 25 in favor of H, : 4 > 25 when yz = 26.0 is approximately 0.72. (Or, equiv- 
alently, Type II errors will be committed roughly 28% of the time when sz = 26.0.) 
As the true mean moves closer to 4, (and becomes more difficult to distinguish) 
the power of the test understandably diminishes. If 4. = 25.5, for example, the graph 
shows that 1 — 6 falls to 0.29. 

Power curves are also useful for comparing one inference procedure with 
another. For every conceivable hypothesis testing situation, a variety of procedures 
for choosing between Hp and H, will be available. How do we know which to use? 

The answer to that question is not always simple. Some procedures will be 
computationally more convenient or easier to explain than others; some will make 
slightly different assumptions about the pdf being sampled. Associated with each 
of them, though, is a power curve. If the selection of a hypothesis test is to hinge 
solely on its ability to distinguish Hp from H;, then the procedure to choose is the 
one having the steepest power curve. 

Figure 6.4.5 shows the power curves for two hypothetical methods A and B, 
each of which is testing Ho:6 = 6, versus H\:0 #6, at the a@ level of significance. 
From the standpoint of power, Method B is clearly the better of the two—it always 
has a higher probability of correctly rejecting Hp when the parameter 6 is not equal 
to 0. 


Factors That Influence the Power of a Test 


The ability of a test procedure to reject Hy when Hp is false is clearly of prime 
importance, a fact that raises an obvious question: What can an experimenter do 
to influence the value of 1 — 8? In the case of the Z test described in Theorem 6.2.1, 
1 — 6 is a function of a, 0, and n. By appropriately raising or lowering the values of 
those parameters, the power of the test against any given jz can be made to equal 
any desired level. 


The Effect of wa on | — 6 


Consider again the test of 


Ab: h= 25.0 
versus 


Ay: uw > 25.0 


Figure 6.4.6 
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discussed earlier in this section. In its original form, w = 0.05, 0 =2.4, n =30, and the 
decision rule called for Hp to be rejected if y > 25.718. 

Figure 6.4.6 shows what happens to | — 6 (when yw = 25.75) if o,n, and yu are 
held constant but @ is increased to 0.10. The top pair of distributions shows the 
configuration that appears in Figure 6.4.2; the power in this case is 1 — 0.4721, or 
0.53. The bottom portion of the graph illustrates what happens when a is set at 0.10 
instead of 0.05—the decision rule changes from “Reject Ho if y > 25.718” to “Reject 
Hp if y = 25.561” (see Question 6.4.2) and the power increases from 0.53 to 0.67: 


1 — B= P(Reject Apo | HM; is true) 
= P(Y > 25.561 | uw = 25.75) 


_ p{ = 25.75 ,, 25.561 — 25.75 
~ \ 2.47./30 ~ — 2.4//30 


= P(Z>-—0.43) 
= 0.6664 
Power = 0.53 
1.0 
Sampling 7 oN y Sampling 
distribution / 7 A \ distribution 
of Y when™/’ A \— _ of Y when 
H, is true FA 0.5 LY % = 25.79 


25.718 
Accept H, ~—~—— Reject H, 
Power = 0.67 
1.0 

Sampling fil a Sampling 
distribution U \/ \ distribution 
of Ywhen™.,/ A \—— _ of Y when 
Hyistrue 0.5 -_ \ p= 25.75 


25.561 


Accept H, ~—~—— Reject H, 


The specifics of Figure 6.4.6 accurately reflect what is true in general: Increasing 
a decreases B and increases the power. That said, it does not follow in practice that 
experimenters should manipulate a to achieve a desired 1 — f. For all the reasons 
cited in Section 6.2, a should typically be set equal to a number somewhere in the 
neighborhood of 0.05. If the corresponding 1 — f against a particular yz is deemed to 
be inappropriate, adjustments should be made in the values of o and/or n. 
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The Effects of o andnon|1—f§ 


Although it may not always be feasible (or even possible), decreasing o will neces- 
sarily increase 1 — B. In the gasoline additive example, o is assumed to be 2.4 mpg, 
the latter being a measure of the variation in gas mileages from driver to driver 
achieved in a cross-country road trip from Boston to Los Angeles (recall p. 351). 
Intuitively, the environmental differences inherent in a trip of that magnitude would 
be considerable. Different drivers would encounter different weather conditions and 
varying amounts of traffic, and would perhaps take alternate routes. 

Suppose, instead, that the drivers simply did laps around a test track rather than 
drive on actual highways. Conditions from driver to driver would then be much more 
uniform and the value of o would surely be smaller. What would be the effect on 
1 — 6 when px. = 25.75 (and a = 0.05) if o could be reduced from 2.4 mpg to 1.2 mpg? 

As Figure 6.4.7 shows, reducing o has the effect of making the Hp distribution 
of Y more concentrated around jz, (= 25) and the H; distribution of Y more concen- 
trated around (= 25.75). Substituting into Equation 6.2.1 (with 1.2 for o in place 


Figure 6.4.7 When o = 2.4 
Power = 0.53 
1.0 
Sampling rs a” i Sampling 
distribution / * / ‘ distribution 
of Y when™, 4 << of Y when 
Af is true Fi 0.5 d ‘ p= 25.75 


25.718 


Accept H) ~—— 


—— Reject H, 


When o= 1.2 
2.0 
Power = 0.96 
’ ; has 

Sampling i \ Sampling 
distribution I distribution 

of Y when —-} 1 of Y when 

p=25.75 


A, is true 


i} 
1 
1 
1 
1 
1 
i} 
\ 
1 
\ 
\ 
\ 
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Example 
6.4.1 
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of 2.4), we find that the critical value y* moves closer to jo [from 25.718 to 25.359 
(= 25 + 1.64- 4)| and the proportion of the H, distribution above the rejection 
region (i.e., the power) increases from 0.53 to 0.96: 
1— B= P(Y = 25.359 | w= 25.75) 
a (z a 25.359 — 25.75 
~  1,2/430 
In theory, reducing o can be a very effective way of increasing the power of 
a test, as Figure 6.4.7 makes abundantly clear. In practice, though, refinements in 
the way data are collected that would have a substantial impact on the magnitude 
of o are often either difficult to identify or prohibitively expensive. More typically, 
experimenters achieve the same effect by simply increasing the sample size. 


Look again at the two sets of distributions in Figure 6.4.7. The increase in 1 — 6 
from 0.53 to 0.96 was accomplished by cutting the denominator of the test statistic 


) = P(Z > —1.78) =0.9625 


(c= 2) in half by reducing the standard deviation from 2.4 to 1.2. The same 


numerical effect would be produced if o were left unchanged but n was increased 


at ba ; . ; 
from 30 to 120—that is, 0 = Vino Because it can easily be increased or decreased, 


the sample size is the parameter that researchers almost invariably turn to as the 
mechanism for ensuring that a hypothesis test will have a sufficiently high power 
against a given alternative. 


Suppose an experimenter wishes to test 
Ho: 4 = 100 
versus 
A: 4 > 100 


at the a =0.05 level of significance and wants | — 6 to equal 0.60 when yz = 103. What 
is the smallest (i.e., cheapest) sample size that will achieve that objective? Assume 
that the variable being measured is normally distributed with o = 14. 

Finding n, given values for a, 1 — B,o, and jw, requires that two simultaneous 
equations be written for the critical value y*, one in terms of the Hp distribution 
and the other in terms of the H; distribution. Setting the two equal will yield the 
minimum sample size that achieves the desired a and 1 — f. 

Consider, first, the consequences of the level of significance being equal to 0.05. 
By definition, 


a = P(We reject Ho | Hp is true) 
=P(Y>y*|uw=100) 


(a 100. y*- | 
=p > 
14/,/n — 14/.Jn 


_p (z= 7) 
—Ta]/a 


=0.05 
But P(Z > 1.64) =0.05, so 
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6.4.2 


or, equivalently, 


14 


Similarly, 


1— B= P(We reject Ho| H, is true) = P(Y > y* | w= 103) 


Y—103_ y*-1 
=( Lees =) =o 


> 
14//n ~ 14//n 
From Appendix Table A.1, though, P(Z > —0.25) = 0.5987 = 0.60, so 


y* — 103 
YT * = -0.25 
14/./n 
which implies that 
14 
y* = 103 — 0.25 - — 6.4.2 
y aa (6.4.2) 
It follows, then, from Equations 6.4.1 and 6.4.2 that 
100 + 1.64 a 103 — 0.25 =a 
er a ant: 
Solving for n shows that a minimum of seventy-eight observations must be taken to 
guarantee that the hypothesis test will have the desired precision. = 


Decision Rules for Nonnormal Data 


Our discussion of hypothesis testing thus far has been confined to inferences involv- 
ing either binomial data or normal data. Decision rules for other types of probability 
functions are rooted in the same basic principles. 

In general, to test Ho: 0 =6,, where 6 is the unknown parameter in a pdf fy (y; 9), 
we initially define the decision rule in terms of 6, where the latter is a sufficient statis- 
tic for 6. The corresponding critical region is the set of values of 6 least compatible 
with 6, (but admissible under H,) whose total probability when Ap is true is a. In 
the case of testing Ho: 4 = 4. versus H: 4 > >, for example, where the data are 
normally distributed, Y is a sufficient statistic for 4, and the least likely values for 
the sample mean that are admissible under H; are those for which y > y*, where 
P(Y >y*| Ap is true) =a. 


A random sample of size n = 8 is drawn from the uniform pdf, fy(y; 0) =1/0,0<y< 
0, for the purpose of testing 


Ho: 6 =2.0 
versus 


A,:6 <2.0 


at the a = 0.10 level of significance. Suppose the decision rule is to be based on Y,, 
the largest order statistic. What would be the probability of committing a Type I 
error when 6 = 1.7? 
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If Ho is true, Y; should be close to 2.0, and values of the largest order statistic 
that are much smaller than 2.0 would be evidence in favor of H,:6 < 2.0. It follows, 
then, that the form of the decision rule should be 

“Reject Hp:9=2.0 if yg<c” 
where P(Y; <c| Hp is true) = 0.10. 
From Theorem 3.10.1, 
an :0=2)=8(2)) L @2nes 
Y yo= —_ 2 2 ’ == 
Therefore, the constant c that appears in the a = 0.10 decision rule must satisfy the 


equation 
c \. 1 
=) --dy=0.1 
[8G he ane 


©)" 0.10 
(3) =0 
implying that c= 1.50. 
Now, 6 when 6 = 1.7 is, by definition, the probability that Y, falls in the 
acceptance region when H: 6 = 1.7 is true. That is, 


or, equivalently, 


1.7 


p= Pcy>1so|o=1.7)= [ s()'. 4a 
go os Fe) aie ae 


(See Figure 6.4.8.) 
5 
3 = 0.63 
‘ \/ i 
> Pdf of Y‘. 
ts 8 1 1 
3g when H, :0= i¢ ; 
3 3 is true / 1 
> Ul U 
a= 7 / 
's 2 d _ Pdf of Y; 
4 
zZ / \/ when Hh : 6 = 2.0 
1 a=0.10. +” , is true 
‘ : 


Reject H) ~—— 


Figure 6.4.8 @ 
Example Four measurements—k,, kx, k3,k4—are taken on a Poisson random variable, X, 
6.4.3 where px (k; 4) =e*A*/k!, k =0, 1, 2,..., for the purpose of testing 
Ao: A=0.8 
versus 


Ay: 7A>0.8 
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What decision rule should be used if the level of significance is to be 0.10, and what 
will be the power of the test when 4 = 1.2? 
From Example 5.6.1, we know that X is a sufficient statistic for 4; the same 


4 
would be true, of course, for }* X;. It will be more convenient to state the deci- 
i=l 
sion rule in terms of the latter because we already know the probability model that 


describes its behavior: If X,, X2, X3, X4 are four independent Poisson random vari- 
4 
ables, each with parameter A, then )> X; has a Poisson distribution with parameter 
i=l 
4, (recall Example 3.12.10). 
Figure 6.4.9 is a Minitab printout of the Poisson probability function having 


4 
A = 3.2, which would be the sampling distribution of )> X; when Ap: A = 0.8 is true. 


i=1 


MTB > pdf; 
SUBC > poisson 3.2. 


Probability Density Function 
Poisson with mean = 3.2 


P(X=x) 
0.040762 
0.130439 
0.208702 
0.222616 
0.178093 
0.113979 
0.060789 
0.027789 
0.011116 
0.003952 
Critical 0.001265 
region 11 0.000368 f a = P(RejectHo| Ho is true) = 0.105408 
12 0.000098 
13 0.000024 
14 0.000006 
15 0.000001 
16 0.000000 


— 
CSCOWMUIDANBWNROM 


Figure 6.4.9 


MTB > pdf; 
SUBC > poisson 4.8. 


Probability Density Function 
Poisson with mean = 4.8 


x oP(X=x) 
0 0.008230 
1 0.039503 
2 0.094807 
3. 0.151691 
4 0.182029 
5 0.174748 
6 0.139798 
7 0.095862 
8 0.057517 
9 0.030676 
10. 0.014724 
11 = 0.006425 
12 0.002570 
13, 0.000949 } 1— B= P(RejectHo| HA, is true) = 0.348993 
14 0.000325 
15 0.000104 
16 0.000031 
17. ~—0.000009 
18 0.000002 
19 0.000001 
20 0.000000 


Figure 6.4.10 


Questions 


Example 
6.4.4 
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4 
By inspection, the decision rule “Reject Ho: A = 0.8 if 5° k; = 6” gives an a close to 
i=l 
the desired 0.10. 
4 
If Hj is true and A= 1.2, )° X; will have a Poisson distribution with a parameter 


i=l 
equal to 4.8. According to Figure 6.4.10, the probability that the sum of a random 
sample of size 4 from such a distribution would equal or exceed 6 (i.e., 1 — 6 when 
A= 1.2) is 0.348993. = 


Suppose a random sample of seven observations is taken from the pdf fy(y; 6) = 
(6+1)y’,0<y<1, to test 


Ao: 6=2 
versus 
Ai:0>2 


As a decision rule, the experimenter plans to record X, the number of y;’s that 
exceed 0.9, and reject Hp if X > 4. What proportion of the time would such a decision 
rule lead to a Type I error? 

To evaluate a = P(Reject Ho | Ho is true), we first need to recognize that X is 
a binomial random variable where n =7 and the parameter p is an area under 
fy(y; 6 = 2): 


p=P(Y =0.9| Ap is true) = P[Y > 0.9| fr; 2) =3y"] 


1 

= : 3y? dy 
0.9 

= 0.271 


It follows, then, that Hp will be incorrectly rejected 9.2% of the time: 


7 


w= P(X>4/0=2=)>( : ) c2mtco.nsy 
k=4 


= 0.092 


Comment The basic notions of Type I and Type II errors first arose in a quality- 
control context. The pioneering work was done at the Bell Telephone Laboratories: 
There the terms producer’s risk and consumer’s risk were introduced for what we 
now call w and §. Eventually, these ideas were generalized by Neyman and Pear- 
son in the 1930s and evolved into the theory of hypothesis testing as we know it 
today. Pa 


6.4.1. Recall the “Math for the Twenty-First Century” 6.4.3. For the decision rule found in Question 6.2.2 to 
hypothesis test done in Example 6.2.1. Calculate the test Ho: 4=95 versus H,:4 495 at the a = 0.06 level of 
power of that test when the true mean is 500. significance, calculate 1 — 6 when wp = 90. 


6.4.2. Carry out the 
sion rule change cited on p. 371 in connection with Ao: 4=60 versus H,: ~~ 60 if the data consist of a random 


Figure 6.4.6. 


details to verify the deci- 6.4.4. Construct a power curve for the a = 0.05 test of 


sample of size 16 from a normal distribution having o = 4. 
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6.4.5. If Ho: uw = 240 is tested against H,: 4 < 240 at the 
a = 0.01 level of significance with a random sample of 
twenty-five normally distributed observations, what pro- 
portion of the time will the procedure fail to recognize that 
ue has dropped to 220? Assume that o = 50. 


6.4.6. Suppose n = 36 observations are taken from a nor- 
mal distribution where o = 8.0 for the purpose of testing 
Ho: 4 = 60 versus H,: u #60 at the a = 0.07 level of signifi- 
cance. The lead investigator skipped statistics class the day 
decision rules were being discussed and intends to reject 
Hy if y falls in the region (60 — y*, 60+ y*). 


(a) Find y*. 

(b) What is the power of the test when yu = 62? 

(c) What would the power of the test be when pu = 62 if 
the critical region had been defined the correct way? 


6.4.7. If Ho: 4 = 200 is to be tested against A): uw < 200 at 
the a = 0.10 level of significance based on a random sam- 
ple of size n from a normal distribution where o = 15.0, 
what is the smallest value for n that will make the power 
equal to at least 0.75 when yp = 197? 


6.4.8. Will n = 45 be a sufficiently large sample to test 
Ho: 4 = 10 versus H;: 1 #10 at the a = 0.05 level of signif- 
icance if the experimenter wants the Type II error prob- 
ability to be no greater than 0.20 when yx = 12? Assume 
that o =4. 


6.4.9. If Ho: 4 = 30 is tested against H,: 4. > 30 using n= 16 
observations (normally distributed) and if 1 — 6 = 0.85 
when pu = 34, what does w equal? Assume that o = 9. 


6.4.10. Suppose a sample of size 1 is taken from the pdf 
fr(y) = (1/A)e~*, y > 0, for the purpose of testing 


Ab: A=1 
versus 
Ay:rA>1 


The null hypothesis will be rejected if y > 3.20. 


(a) Calculate the probability of committing a Type I 
error. 

(b) Calculate the probability of committing a Type II 
error when A= §. 

(c) Draw a diagram that shows the a and 6 calculated in 
parts (a) and (b) as areas. 


6.4.11. Polygraphs used in criminal investigations typi- 
cally measure five bodily functions: (1) thoracic respira- 
tion, (2) abdominal respiration, (3) blood pressure and 
pulse rate, (4) muscular movement and pressure, and (5) 
galvanic skin response. In principle, the magnitude of 
these responses when the subject is asked a relevant ques- 
tion (“Did you murder your wife?") indicate whether he 
is lying or telling the truth. The procedure, of course, is 


not infallible, as a recent study bore out (82). Seven expe- 
rienced polygraph examiners were given a set of forty 
records—twenty were from innocent suspects and twenty 
from guilty suspects. The subjects had been asked eleven 
questions, on the basis of which each examiner was to 
make an overall judgment: “Innocent" or “Guilty." The 
results are as follows: 


Suspect’s True Status 


Innocent Guilty 
Examiner’s “Innocent” 131 15 
Decision “Guilty” 9 125 


What would be the numerical values of a and £ in this con- 
text? In a judicial setting, should Type I and Type II errors 
carry equal weight? Explain. 


6.4.12. An urn contains ten chips. An unknown number 
of the chips are white; the others are red. We wish to test 
Hp: exactly half the chips are white 
versus 
H,: more than half the chips are white 
We will draw, without replacement, three chips and reject 


Ap if two or more are white. Find a. Also, find 6 when the 
urn is (a) 60% white and (b) 70% white. 


6.4.13. Suppose that a random sample of size 5 is drawn 
from a uniform pdf: 


frQi 9) = (: ees 
0, elsewhere 
We wish to test 
Ay:9 =2 
versus 
A,:0>2 


by rejecting the null hypothesis if y,.,, =k. Find the value 


of k that makes the probability of committing a Type I 
error equal to 0.05. 


6.4.14. A sample of size 1 is taken from the pdf 
fry) =@+ Dy’, 


The hypothesis Hy: 6 = | is to be rejected in favor of H,:6 > 
1 if y > 0.90. What is the test’s level of significance? 


O<y<l 


6.4.15. A series of n Bernoulli trials is to be observed as 
data for testing 


Ab: p= 5 
versus 
Hi: p> 5 
The null hypothesis will be rejected if k, the observed 


number of successes, equals n. For what value of p will 
the probability of committing a Type II error equal 0.05? 
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6.4.16. Let X, be a binomial random variable with n = 2 
and py, = P(success). Let X, be an independent bino- 
mial random variable with n =4 and px, = P(success). Let 
X=X,+ X). Calculate a if 


. _ ail 
Ap: Px, = Px, = 3 
versus 

, 1 
Ay: px, = Px, > 3 


is to be tested by rejecting the null hypothesis when k > 5. 


6.4.17. A sample of size 1 from the pdf fy(y) = (+ 
0)y’,0< y <1, is to be the basis for testing 


Ay:¢=1 
versus 
A,:@ <1 
The critical region will be the interval y < i. Find an 
expression for 1 — f as a function of 6. 
6.4.18. An experimenter takes a sample of size | from 
the Poisson probability model, px(k) = e%ak/k!,k = 
0,1,2,..., and wishes to test 
Ab: A=6 
versus 
HN: <6 


by rejecting Hp if k <2. 


(a) Calculate the probability of committing a Type I 
error. 

(b) Calculate the probability of committing a Type II 
error when A =4. 


6.4.19. A sample of size 1 is taken from the geometric 
probability model, px (k) = (1 — p)*'p, k= 1,2,3,..., to 
test Hy: p = + versus H;: p > +. The null hypothesis is to 
be rejected if k > 4. What is the probability that a Type IT 
error will be committed when p= +? 


6.4.20. Suppose that one observation from the exponen- 
tial pdf, f(y) =Ae’, y > 0, is to be used to test Hy: 4 = 
1 versus H,:A < 1. The decision rule calls for the null 
hypothesis to be rejected if y > In 10. Find 6 as a function 
of 2. 


6.4.21. A random sample of size 2 is drawn from a uni- 
form pdf defined over the interval [0,0]. We wish to 
test 


Ah: 6=2 
versus 
A,:0 <2 


by rejecting Hy when y, + y2. <k. Find the value for k that 
gives a level of significance of 0.05. 


6.4.22. Suppose that the hypotheses of Question 6.4.21 
are to be tested with a decision rule of the form “Reject 
Ho: 0 =2 if yy. <k*.” Find the value of k* that gives a level 
of significance of 0.05 (see Theorem 3.8.5). 


6.5 A Notion of Optimality: The Generalized 


Likelihood Ratio 


In the next several chapters we will be studying some of the particular hypothe- 
sis tests that statisticians most often use in dealing with real-world problems. All 
of these have the same conceptual heritage—a fundamental notion known as the 
generalized likelihood ratio, or GLR. More than just a principle, the generalized 
likelihood ratio is a working criterion for actually suggesting test procedures. 

As a first look at this important idea, we will conclude Chapter 6 with an appli- 
cation of the generalized likelihood ratio to the problem of testing the parameter 0 
in a uniform pdf. Notice the relationship here between the likelihood ratio and the 
definition of an “optimal” hypothesis test. 


Suppose yi, y2,... 


, Yn is arandom sample from a uniform pdf over the interval 


[0, 0], where @ is unknown, and our objective is to test 


Ao: d= 6, 
versus 


H:0 <4, 


at a specified level of significance a. What is the “best” decision rule for choosing 
between Hp and H,, and by what criterion is it considered optimal? 
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As a starting point in answering those questions, it will be necessary to define 
two parameter spaces, w and Q. In general, w is the set of unknown parameter values 
admissible under Hp. In the case of the uniform, the only parameter is 0, and the null 
hypothesis restricts it to a single point: 


w= {0:0 =0,} 


The second parameter space, Q, is the set of all possible values of all unknown 
parameters. Here, 


Q= {0:0<6 <4} 


Now, recall the definition of the likelihood function, L, from Definition 5.2.1. 
Given a sample of size n from a uniform pdf, 


(4), O<y, <0 


L=L(@)= Ri fr (Wis = 0, otherwise 


For reasons that will soon be clear, we need to maximize L(@) twice, once under w 
and again under @. Since 6 can take on only one value—6, —under ow, 


iy 
max L(9) = L(@)) = (4) » O<y 5% 
x 0, otherwise 


Maximizing L(@) under (2—that is, with no restrictions—is accomplished by sim- 
ply substituting the maximum likelihood estimate for 6 into L(@). For the uni- 
form parameter, ymax is the maximum likelihood estimate (recall Question 5.2.10). 
Therefore, 


1 n 
max L(@) = (—) 


For notational simplicity, we denote max L(@) and max L(@) by L(m.) and L(Q.), 


respectively. 


Definition 6.5.1. Let y, y2,..., y, be a random sample from fy(y; 61, ..., O). 
The generalized likelihood ratio, A, is defined to be 


max L(6),..., O&) L(we) 
A=— = s 
max L(61,..., %) L(Q,) 


For the uniform distribution, 


j= (1/6o)” = (=) 
(1/¥max)” A 
Note that, in general, 4 will always be positive but never greater than 1 (why?). 
Furthermore, values of the likelihood ratio close to 1 suggest that the data are very 
compatible with Ho. That is, the observations are “explained” almost as well by the 
Ho parameters as by any parameters [as measured by L(w,) and L(Q,)]. For these 
values of A we should accept Ho. Conversely, if L(w.)/L(Q.) were close to 0, the 


data would not be very compatible with the parameter values in w and it would 
make sense to reject Ho. 


Figure 6.5.1 
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Definition 6.5.2. A generalized likelihood ratio test (GLRT) is one that rejects 
Ho whenever 


0<A<)A* 
where A* is chosen so that 
P(O<A <A*| A) is true) =a 


(Note: In keeping with the capital letter notation introduced in Chapter 3, A 
denotes the generalized likelihood ratio expressed as a random variable.) 


Let f, (A | Ho) denote the pdf of the generalized likelihood ratio when Ab is true. 
If fa (A | Ho) were known, 4* (and, therefore, the decision rule) could be determined 
by solving the equation 
* 
a= fa(a| Ho) ddr 
0 


(see Figure 6.5.1). In many situations, though, f,(A| Ho) is not known, and it 
becomes necessary to show that A is a monotonic function of some quantity W, 
where the distribution of W is known. Once we have found such a statistic, any test 
based on w will be equivalent to one based on A. 

Here, a suitable W is easy to find. Note that 


Yy, n 
P(A 4° | Hp iste) =a =P | ( zt) <i*| Hy iste] 
) 


Y, te n - 
=P( a <vVi*| Ho is tne) 


0 


fx Al Ho) 


0 ny 
Reject AH) —! 


Let W = Ymax/09 and w* = VA*. Then 
P(A <X* | Ap is true) = P(W < w* | Ap is true) (6.5.1) 


Here the right-hand side of Equation 6.5.1 can be evaluated from what we already 
know about the density function for the largest order statistic from a uniform 
distribution. Let fy,,,, (v; 9) be the density function for Yyx. Then 


fw (w; 00) = 40 f¥inax (Pow 90) (recall Theorem 3.8.2) 
which, from Theorem 3.10.1, reduces to 


0, 6 n—1 
Go ee age. 0<w<l 
90 
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Therefore, 


* 


Ww 
P(W <w* | Hyis tue) = [ nw"!dw =(w*)"=a 
0 


implying that the critical value for W is 


w= va 


That is, the GLRT calls for Ho to be rejected if 


Questions 


6.5.1. Let k,,ko,...,k, be a random sample from the 
geometric probability function 


pep Usp) py Ka Poss 


Find J, the generalized likelihood ratio for testing Ho: p = 
Po versus H;: p # po. 


6.5.2. Let yj, ¥2,...,¥io9 be a random sample from an 
exponential pdf with unknown parameter A. Find the form 
of the GLRT for Hp: A =A versus H,:4 #Ao. What integral 
would have to be evaluated to determine the critical value 
if a were equal to 0.05? 


6.5.3. Let y,, y2,..., y, be a random sample from a notr- 
mal pdf with unknown mean yw and variance 1. Find the 
form of the GLRT for Ho: uw = po versus Hy: uw A Lo. 


6.5.4. In the scenario of Question 6.5.3, suppose the alter- 
native hypothesis is H,: 4 =, for some particular value 
of w,. How does the likelihood ratio test change in this 
case? In what way does the critical region depend on the 
particular value of j4,? 


Ymax 
w=—< 
a 


6.5.5. Let k denote the number of successes observed 
in a sequence of n independent Bernoulli trials, where 
p= P(success). 


(a) Show that the critical region of the likelihood ratio 
test of Hy: p=} versus H,: p # 5 can be written in 
the form 


k-In(k) + (n—k)-In(u—k) > A™ 
(b) Use the symmetry of the graph of 
f(k) =k -In(k) + (n—k)- Ina —k) 


to show that the critical region can be written in the 
form 


where c is a constant determined by a. 


6.5.6. Suppose a sufficient statistic exists for the parame- 
ter 0. Use Theorem 5.6.1 to show that the critical region 
of a likelihood ratio test will depend on the sufficient 
statistic. 


6.6 Taking a Second Look at Statistics (Statistical 
Significance versus “Practical” Significance) 


The most important concept in this chapter—the notion of statistical significance — 
is also the most problematic. Why? Because statistical significance does not always 
mean what it seems to mean. By definition, the difference between, say, y and 1, is 
statistically significant if Ho: 4 = 4, can be rejected at the a = 0.05 level. What that 
implies is that a sample mean equal to the observed y is not likely to have come from 
a (normal) distribution whose true mean was j4,. What it does not imply is that the 
true mean is necessarily much different than 5. 

Recall the discussion of power curves in Section 6.4 and, in particular, the effect 
of n on | — B. The example illustrating those topics involved an additive that might 
be able to increase a car’s gas mileage. The hypotheses being tested were 


Figure 6.6.1 


Figure 6.6.2 
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Ab: = 25.0 
versus 
Ay: > 25.0 
where o was assumed to be 2.4 (mpg) and a was set at 0.05. If n = 30, the decision 


rule called for Ho to be rejected when y > 25.718 (see p. 354). Figure 6.6.1 is the test’s 
power curve [the point (uw, 1 — 6) = (25.75, | — 0.47) was calculated on p. 368]. 


The important point was made in Section 6.4 that researchers have a variety of 
ways to increase the power of a test—that is, to decrease the probability of commit- 
ting a Type II error. Experimentally, the usual way is to increase the sample size, 
which has the effect of reducing the overlap between the Hy and H; distributions 
(Figure 6.4.7 pictured such a reduction when the sample size was kept fixed but o 
was decreased from 2.4 to 1.2). Here, to show the effect of n on 1 — B, Figure 6.6.2 
superimposes the power curves for testing Ho: 4 = 25.0 versus Hj: ~ > 25.0 in the 
cases where n = 30, n = 60, and n = 900 (keeping a = 0.05 and o = 2.4). 
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There is good news in Figure 6.6.2 and there is bad news in Figure 6.6.2. The 
good news—not surprisingly —is that the probability of rejecting a false hypothesis 
increases dramatically as n increases. If the true mean p is 25.25, for example, the 
Z test will (correctly) reject Ho: u = 25.0 14% of the time when n = 30, 20% of the 
time when n = 60, and a robust 93% of the time when n = 900. 

The bad news implicit in Figure 6.6.2 is that any false hypothesis, even one where 
the true yw is just “epsilon” away from j,, can be rejected virtually 100% of the 
time if a large enough sample size is used. Why is that bad? Because saying that a 
difference (between y and ,) is statistically significant makes it sound meaningful 
when, in fact, it may be totally inconsequential. 
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Suppose, for example, an additive could be found that would increase a car’s 
gas mileage from 25.000 mpg to 25.001 mpg. Such a minuscule improvement would 
mean basically nothing to the consumer, yet if a large enough sample size were 
used, the probability of rejecting Ho: 4 = 25.000 in favor of Hi: > 25.000 could 
be made arbitrarily close to 1. That is, the difference between y and 25.000 would 
qualify as being statistically significant even though it had no “practical significance” 
whatsoever. 

Two lessons should be learned here, one old and one new. The new lesson is 
to be wary of inferences drawn from experiments or surveys based on huge sample 
sizes. Many statistically significant conclusions are likely to result in those situations, 
but some of those “reject Ho’s” may be driven primarily by the sample size. Paying 
attention to the magnitude of y — tt, (or A — po) is often a good way to keep the 
conclusion of a hypothesis test in perspective. 

The second lesson has been encountered before and will come up again: Ana- 
lyzing data is not a simple exercise in plugging into formulas or reading computer 
printouts. Real-world data are seldom simple, and they cannot be adequately sum- 
marized, quantified, or interpreted with any single statistical technique. Hypothesis 
tests, like every other inference procedure, have strengths and weaknesses, assump- 
tions and limitations. Being aware of what they can tell us—and how they can trick 
us—is the first step toward using them properly. 
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I know of scarcely anything so apt to impress the imagination as the wonderful form 
of cosmic order expressed by the “law of frequency of error” (the normal 
distribution). The law would have been personified by the Greeks and deified, if they 
had known of it. It reigns with serenity and in complete self effacement amidst the 
wildest confusion. The huger the mob, and the greater the anarchy, the more perfect 
is its sway. It is the supreme law of Unreason. 

—Francis Galton 


7.1. Introduction 


Finding probability distributions to describe — and, ultimately, to predict — empirical 
data is one of the most important contributions a statistician can make to the 
research scientist. Already we have seen a number of functions playing that role. 
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The binomial is an obvious model for the number of correct responses in the 
Pratt-Woodruff ESP experiment (Case Study 4.3.1); the probability of holding a win- 
ning ticket in the Florida Lottery is given by the hypergeometric (Example 3.2.6); 
and applications of the Poisson have run the gamut from radioactive decay (Case 
Study 4.2.2) to the number of wars starting in a given year (Case Study 4.2.3). Those 
examples notwithstanding, by far the most widely used probability model in statistics 
is the normal (or Gaussian) distribution, 


—_ 1} -ato-w/oP 

fr) aes , cO<y<0o (7.1.1) 
Some of the history surrounding the normal curve has already been discussed in 
Chapter 4—how it first appeared as a limiting form of the binomial, but then soon 
found itself used most often in nonbinomial situations. We also learned how to find 
areas under normal curves and did some problems involving sums and averages. 
Chapter 5 provided estimates of the parameters of the normal density and showed 
their role in fitting normal curves to data. In this chapter, we will take a second look 
at the properties and applications of this singularly important pdf, this time paying 

attention to the part it plays in estimation and hypothesis testing. 


Y-u 
7.2 Comparin A eand —& 
gt ak S//n 

Suppose that a random sample of n measurements, Y;, Y2,..., Y,, is to be taken on 


a trait that is thought to be normally distributed, the objective being to draw an 
inference about the underlying pdf’s true mean, ju. If the variance o* is known, we 
already know how to proceed: A decision rule for testing Ho: w= uo is given in 
Theorem 6.2.1, and the construction of a confidence interval for uw is described in 
Section 5. = As we learned, both of those procedures are based on the fact that the 


ratio Z= alun Has a standard normal distribution, f7(z). 


In see though, the parameter co? is seldom known, so the ratio oh cannot 
be calculated, even if a value for the mean—say, jz9 —is substituted for w. Typically, 
the only information experimenters have about o? is what can be gleaned from the 


Y,’s themselves. The usual estimator for the population variance, of course, is S* = 
n 22, 
— >< (¥; — Y)’, the unbiased version of the maximum likelihood estimator for 0. 
i=l 
The question is, what effect does ee o with S have on the Z ratio? Are there 


are probabilistic differences between ~ a oi and a 7 Aci 

Historically, many early practitioners of statistics felt that replacing o with S 
had, in fact, no effect on the distribution of the Z ratio. Sometimes they were right. 
If the sample size is very large (which was not an unusual state of affairs in many of 


the early applications of statistics), the estimator S is essentially a constant sae for 
all intents and purposes equal to the true o. Under those conditions, the ratio 5 7 ok 
will behave much like a standard normal random variable, Z. When the sample size 
n is small, though, replacing o with S does matter, and it changes the way we draw 
inferences about ju. 

Credit for recognizing that 2 shih and a Ta do not have the same distribution goes 
to William Sealy Gossett. After graduating in 1899 from Oxford with First Class 
degrees in Chemistry and Mathematics, Gossett took a position at Arthur Guin- 
ness, Son & Co., Ltd., a firm that brewed a thick, dark ale known as stout. Given 


Figure 7.2.1 
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the task of making the art of brewing more scientific, Gossett quickly realized that 
any experimental studies would necessarily face two obstacles. First, for a variety 
of economic and logistical reasons, sample sizes would invariably be small; and sec- 
ond, there would never be any way to know the exact value of the true variance, o”, 
associated with any set of measurements. 

So, when the objective of a ae was to draw an inference about jy, Gossett 


found himself working with the ratio ; 7 sae where n was often on the order of four 
or five. The more he encountered that situation, the more he became convinced 
that ratios of that sort are not adequately described by the standard normal pdf. 
In particular, the distribution of aa seemed to have the same general bell-shaped 
configuration as fz(z), but the tails were “thicker” —that is, ratios much smaller than 
zero or much greater than zero were not as rare as the standard normal pdf would 
predict. 

Figure 7.2.1 illustrates the distinction between the distributions pe a and vu AG 
that caught Gossett’s attention. In Figure 7.2.1a, five hundred samples of size n = 4 
have been drawn from_a normal distribution where the value of o is known. For 
each sample, the ratio a7 has been computed. Superimposed over the shaded his- 
togram of those five hundred ratios is the standard normal curve, fz(z). Clearly, the 


probabilistic behavior of the random variable or is entirely consistent with f7(z). 


0.2 


pores distribution of 


Y-u € (500 samples) 
al 


Density 
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The histogram pictured in Figure 7.2.1b is also based on five hundred samples 
of size n = 4 drawn from a normal distribution. Here, though, Shas been calculated 


for each sample, so the ratios comprising the histogram are Ea rather than are 
In this case, the superimposed standard normal pdf does not adequately describe 
the histogram —specifically, it underestimates the number of ratios much less than 
zero as well as the number much larger than zero (which is exactly what Gossett had 
noted). 

Gossett published a paper in 1908 entitled “The Probable Error of a Mean,” 
in which he derived a formula for the pdf of the ratio ETE To prevent disclo- 
sure of confidential company information, Guinness prohibited its employees from 
publishing any papers, regardless of content. So, Gossett’s work, one of the major 
statistical breakthroughs of the twentieth century, was published under the name 
“Student.” 

Initially, Gossett’s discovery attracted very little attention. Virtually none of his 
contemporaries had the slightest inkling of the impact that Gossett’s paper would 
have on modern statistics. Indeed, fourteen years after its publication, Gossett sent 
R.A. Fisher a tabulation of his distribution, with a note saying, “I am sending you a 
copy of Student’s Tables as you are the only man that’s ever likely to use them.” 

Fisher very much understood the value of Gossett’s work and believed that Gos- 
sett had effected a “logical revolution.” Fisher presented a rigorous mathematical 
derivation of Gossett’s pdf in 1924, the core of which sea in Appendix 7.A.2. 


Fisher somewhat arbitrarily chose the letter ¢ for the * 57, Statistic. Consequently, its 


Sida 
pdf is known as the Student t distribution. 


7.3 Deriving the Distribution of T= ee 


Broadly speaking, the set of probability functions that statisticians have occasion 
to use fall into two categories. There are a dozen or so that can effectively model 
the individual measurements taken on a variety of real-world phenomena. These 
are the distributions we studied in Chapters 3 and 4—most notably, the normal, 
binomial, Poisson, exponential, hypergeometric, and uniform. There is a smaller set 
of probability distributions that model the behavior of functions based on sets of 
n random variables. These are called sampling distributions, and they are typically 
used for inference purposes. 

The normal distribution belongs to both categories. We have seen a number of 
scenarios (IQ scores, for example) where the Gaussian distribution is very effective 
at describing the distribution of repeated measurements. At the same fase the nor- 
mal distribution is used to model the probabilistic behavior of T = —= qa In the latter 
capacity, it serves as a sampling distribution. 

Next to the normal distribution, the three most important sampling distributions 
are the Student t distribution, the chi square distribution, and the F distribution. All 
three will be introduced in this section, Sa because we need the latter two to 
derive fr(t), the pdf for the ¢ ratio, T= 5 7 3k So, although our primary objective in 
this section is to study the Student rf distribution, we will in the process introduce the 
two other sampling distributions that we will be encountering over and over again 
in the chapters ahead. 

Deriving the pdf for a ¢ ratio is not a simple matter. That may come as a sur- 


prise, given that deducing the pdf for —“ is quite easy (using moment-generating 


sign 


Theorem 
7.3.1 
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functions). But going from Y-“ to = creates some major mathematical 


o/ vn S/J/n a 
complications because T (unlike Z) is the ratio of two random variables, Y and 
S, both of which are functions of n random variables, Y1, Y2,..., Yn. In general— 


and this ratio is no exception—finding pdfs of quotients of random variables is 
difficult, especially when the numerator and denominator random variables have 
cumbersome pdfs to begin with. 


As we will see in the next few pages, the derivation of f7(r) plays out in several 
m 


steps. First, we show that >> Ze, where the Z;’s are independent standard normal 
j=l 
random variables, has a gamma distribution (more specifically, a special case of the 
gamma distribution, called a chi square distribution). Then we show that Y and S?, 
based on a random sample of size n from a normal distribution, are independent 


: 182 : fete ; 
random variables and that “ 8 has a chi square distribution. Next we derive the 


pdf of the ratio of two independent chi square random variables (which is called 


S/Ja 
written as the quotient of two independent chi square random variables, making it 
a special case of the F distribution. Knowing the latter allows us to deduce fr(t). 


ee 
the F distribution). The final step in the proof is to show that T* = ( st) can be 


m 

Let U= >- Zi, where Z,, Z2,...,Zm are independent standard normal random 
j=! 

variables. Then U has a gamma distribution with r = 5 and = . That is, 


1 
= (m/2)—1,—u/2 
u) = ————~u e , ur>0 
ful) IPP (2) > 


Proof First take m= 1. For any u>0, 


Fy2(u) = P(Z? <u) = P(—Ju<Z< Ju) =2P(0<Z< Vu) 


Ju 
= as / ee /2 dz 
J 20 Jo 


Differentiating both sides of the equation for Fz2(u) gives f72(u): 


Lap 1! y/2-1e—u/2 


2 
Vin2ja DAL () 


Notice that fy(u) = fz2(u) has the form of a gamma pdf with r = 5 and 4 = s. By 
Theorem 4.6.4, then, the sum of m such squares has the stated gamma distribution 


with r=m (5)=% anda=}. 


d 
f2u)= a 


The distribution of the sum of squares of independent standard normal random 
variables is sufficiently important that it gets its own name, despite the fact that it 
represents nothing more than a special case of the gamma distribution. 


Definition 7.3.1. The pdf of U= )° Zi, where Z,, Z2,..., Z,, are independent 
j=l 


standard normal random variables, is called the chi square distribution with m 
degrees of freedom. 
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The next theorem is especially critical in the derivation of f;(t). Using simple 
algebra, it can be shown that the square of a t ratio can be written as the quotient of 
two chi square random variables, one a function of Y and the other a function of S?. 
By showing that Y and S? are independent (as Theorem 7.3.2 does), Theorem 3.8.4 
can be used to find an expression for the pdf of the quotient. 


Theorem Let Y,, Y2,..., Yn, be a random sample from a normal distribution with mean « and 
7.3.2 variance o*. Then 


a. S? and ¥ are independent. 


n are 
b, 2=ps “ > (¥; — Y)? has a chi square distribution with n —1 degrees of 
i=l 


o2 


freedom. 


Proof See Appendix 7.A.2 


As we will see shortly, the square of a ¢ ratio is a special case of an F ran- 
dom variable. The next definition and theorem summarize the properties of the 
F distribution that we will need to find the pdf associated with the Student tr 
distribution. 


Definition 7.3.2. Suppose that U and V are independent chi square random 
variables with n and m degrees of freedom, respectively. A random variable of 


the form ma is said to have an F distribution with m and n degrees of freedom. 


Comment The F in the name of this distribution commemorates the renowned 
statistician Sir Ronald Fisher. 


Theorem Suppose Finn = wn denotes an F random variable with m and n degrees of freedom. 
7.3.3 The pdf of Finn has the form 
r (24) m2 yn2/2 ylm/2)-1 
w)= , w>0 
FRG ) r (#)r (4) (n+ mw)™+n)/2 = 


Proof We begin by finding the pdf for V/U. From Theorem 7.3.1 we know that 
Fu) = arp h te? and ful) = srepgyee"”. 
From Theorem 3.8.4, we have that the pdf of W=V/U is 


fvju(w) =I |u| fu(u) fv (uw) du 


00 1 1 
_ (n/2)—1 —u/2 (m/2)-1 —uw/2 gy 
i “39PT(n]2) Can f ie 
1 


CO 
= (m/2)-1 nem _| _((1+w)/2]u 
— w uz e du 
20427 (n/2)T (m/2) I 


The integrand is the variable part of a gamma density with r = (n+ m)/2 and A= 
(1+ w)/2. Thus, the integral equals the inverse of the density’s constant. This gives 


Figure 7.3.1 
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S/J/n 
fvju = 1 worl)! 5 (") = . (") wor/2)-1 
Qn+m)/2P (n/2)T (m/2) [1+w)/2) T@/2)rm/2) i+w)" 


The statement of the theorem, then, follows from Theorem 3.8.2: 


1 W m m 
Fug () = favjuw) = fore ( ) =" hw (Zu) 


/m n/m 


F Tables 


When graphed, an F distribution looks very much like a typical chi square 
distribution — values of i “can never be negative and the F pdf is skewed sharply 
to the right. Clearly, the complexity of fr,,,,(7) makes the function difficult to work 
with directly. Tables, though, are widely available that give various percentiles of F 
distributions for different values of m and n. 

Figure 7.3.1 shows fr,,(r). In general, the symbol Fy n,, will be used to 
denote the 100 pth percentile of the F distribution with m and n degrees of free- 
dom. Here, the 95th percentile of fp,,(r)—that is, F'95,3,5—is 5.41 (see Appendix 


Table A.4). 
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Using the F Distribution to Derive the pdf for t Ratios 


Now we have all the background results necessary to find the pdf of re. Actually, 
though, we can do better than that because what we have been calling the “rt ratio” 
is just one special case of an entire family of quotients known as ¢ ratios. Finding the 


pdf for that entire family will give us the probability distribution for sak as well. 


Definition 7.3.3. Let Z be a standard normal random variable and let U be a 
chi square random variable independent of Z with n degrees of freedom. The 
Student t ratio with n degrees of freedom is denoted T,,, where 


Comment The term “degrees of freedom” is often abbrieviated by df. 
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Lemma The pdf for T,, is symmetric: fr,(t) = fr,(—t), for all t. 


Proof For convenience of notation, let V = (Z . Then by Theorem 3.8.4 and the 
symmetry of the pdf of Z, 


fn= f ufvo) feltodde= ufv(v) fz(—tu)dv= fr, (—2) 


Theorem The pdf for a Student t random variable with n degrees of freedom is given by 
7.3.4 
rs) 


Jaa (*) (1 rs 2 


o<t<o 


fr, (t) = 


Proof Note that i ee ~ has an F distribution with 1 and n df. Therefore, 


on 
/2 n+l 
aehes ) in : t>0 


rOrg) | Grae 


frQ)= 


Suppose that t > 0. By the symmetry of f7, (t), 
1 
Fr,(t)=P(T, <= at PO<T, <t) 


1 1 
Shige eine) 


2 
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| ed eee 


Differentiating Fy, (t) gives the stated result: 


fr,(t) = Fy, Q)=t- fr’) 
= nlp Cee Bye iid l 
r (3) r (3) (n + 12) e+D/2 


r (24!) 1 
~ Van (2) fi+(2)]"" 


Comment Over the years, the lowercase t has come to be the accepted symbol 
for the random variable of Definition 7.3.3. We will follow that convention when 
the context allows some flexibility. In mathematical statements about distributions, 
though, we will be consistent with random variable notation and denote the Student 
t ratio as T,. 


All that remains to be verified, then, to accomplish our original goal of finding 
the pdf for ae is to show that the latter is a special case of the Student t random 


variable described in Definition 7.3.3. Theorem 7.3.5 provides the details. Notice 
that a sample of size n yields a f ratio in this case having n — 1 degrees of freedom. 


Figure 7.3.2 
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S/Jn 
Let Y,, Y2,..., Y, be a random sample from a normal distribution with mean ju and 
standard deviation o. Then 
Y—-p 
Th-1 = 
S/J/n 


has a Student t distribution with n — | degrees of freedom. 


Proof We can rewrite vu in the form 
/Jn 


=. Y-pw 


Y-p o/Jn 


S/J/n [=H 


o2(n—1) 


Yai 2 : _1)82 
But aia is a standard normal random variable and aie 


distribution with n — 1 df. Moreover, Theorem 7.3.2 shows that 
yoy 4 (n — 1)S? 
and =—.—— 
o//n o 


are independent. The statement of the theorem follows immediately, then, from 
Definition 7.3.3. 


has a chi square 


f.,(t) and f, (Z): How the Two Pdfs Are Related 


Despite the considerable disparity in the appearance of the formulas for fy, (t) and 
fz(z), Student t distributions and the standard normal distribution have much in 
common. Both are bell shaped, symmetric, and centered around zero. Student t 
curves, though, are flatter. 

Figure 7.3.2 is a graph of two Student ¢ distributions—one with 2 df and the 
other with 10 df. Also pictured is the standard normal pdf, fz(z). Notice that as n 
increases, fr, (t) becomes more and more like fz(z). 


The convergence of f7,(t) to fz(z) is a consequence of two estimation 
properties: 


1. The sample standard deviation is asymptotically unbiased for o. 
2. The standard deviation of S goes to 0 as n approaches oo. (See Question 7.3.4.) 


Y-u 


Ss) Ja will become increasingly 


Therefore as n gets large, the probabilistic behavior of 


2c 8 fat auth y_ ; 
similar to the distribution of aaa —that is, to fz(z). 
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Questions 


7.3.1. Show directly—without appealing to the fact that 
x2 is a gamma random variable—that fy(u) as stated in 
Definition 7.3.1 is a true probability density function. 


7.3.2. Find the moment-generating function for a chi 
square random variable and use it to show that E(x?) =n 
and Var(x) =2n. 


7.3.3. Is it believable that the numbers 65, 30, and 55 are 
a random sample of size 3 from a normal distribution with 
2 =50 and o = 10? Answer the question by using a chi 
square distribution. [Hint: Let Z; = (Y; — 50)/10 and use 
Theorem 7.3.1.] 


7.3.4. Use the fact that (n — 1)S?/o? is a chi square 
random variable with n — | df to prove that 
4 


% 20 
Var(S~) = 
n—1 
(Hint: Use the fact that the variance of a chi square 
random variable with k df is 2k.) 


7.3.5. Let Y;, ¥5,..., Y, be a random sample from a nor- 
mal distribution. Use the statement of Question 7.3.4 to 
prove that S? is consistent for 0”. 


7.3.6. If Y is a chi square random variable with n degrees 
of freedom, the pdf of (Y — n)//2n converges to fz(z) asn 
goes to infinity (recall Question 7.3.2). Use the asymptotic 
normality of (Y —n)/./2n to approximate the fortieth per- 
centile of a chi square random variable with 200 degrees 
of freedom. 


7.3.7. Use Appendix Table A.4 to find 


(a) F'50,67 
(b) Foo1,15,5 
(ce) Foo22 


7.3.8. Let V and U be independent chi square random 
variables with 7 and 9 degrees of freedom, respectively. Is 
it more likely that “4 will be between (1) 2.51 and 3.29 or 


u/9 
(2) 3.29 and 4.202 


7.3.9. Use Appendix Table A.4 to find the values of x that 
satisfy the following equations: 


(c) P(F,x >5.35)=0.01 


(d) P(0.115 < F;,, <3.29) =0.90 
(e) P (x < iG = 0.25, where V is a chi square random 
variable with 2 df and U is an independent chi square 
random variable with 3 df. 
7.3.10. Suppose that two independent samples of size n 
are drawn from a normal distribution with variance o?. 
Let S? and S} denote the two sample variances. Use the 
fact that i has a chi square distribution with n — | df 
to explain why 
Lim Fn =1 
7.3.11. If the random variable F has an F distribution 
with m and n degrees of freedom, show that 1/F has an 
F distribution with n and m degrees of freedom. 


7.3.12. Use the result claimed in Question 7.3.11 to 
express percentiles of f;, ,,(7) in terms of percentiles from 
Sn, (r). That is, if we know the values a and b for which 
P(a< Fy.» <b) =q, what values of c and d will satisfy the 
equation P(c < F,,, <d)=q? “Check” your answer with 
Appendix Table A.4 by comparing the values of F 5.7, 
F'95,2,85 Fo5,8,25 and F 95.8.0. 


7.3.13. Show that as n > oo, the pdf of a Student t random 
variable with n df converges to f7(z). (Hint: To show that 
the constant term in the pdf for 7, converges to 1/./2z, 
use Stirling’s formula, 


n!=<J2nnn"e") 
Also, recall that lim (1 + a)" =e", 


7.3.14. Evaluate the integral 


il 
| eet 
0 1+.x? 


using the Student f distribution. 


7.3.15. For a Student t random variable Y with n degrees 
of freedom and any positive integer k, show that E(Y**) 
exists if 2k <n. (Hint: Integrals of the form 


m8 
——— dy 
i (+ y)F 


are finite if a > 0, 8 > 0, and af > 1.) 


7.4 Drawing Inferences About ju 


One of the most common of all statistical objectives is to draw inferences about the 
mean of the population being represented by a set of data. Indeed, we already took 
a first look at that problem in Section 6.2. If the Y;’s come from a normal distibution 


Figure 7.4.1 
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where oO is known, the null hypothesis Ho : 4 = jo can be tested by calculating a Z 
ratio, a (recall Theorem 6.2.1). 


Implicit in that solution, though, is an assumption not likely to be satisfied: 
rarely does the experimenter actually know the value of o. Section 7.3 dealt 


with precisely that scenario and derived the pdf of the ratio T,_; = v4, where 


o has been replaced by an estimator, S. Given T,_; (which we learned has a 
Student t distribution with n — 1 degrees of freedom), we now have the tools nec- 
essary to draw inferences about yw in the all-important case where o is not known. 
Section 7.4 illustrates these various techniques and also examines the key assump- 
tion underlying the “f test” and looks at what happens when that assumption is not 
satisfied. 


t Tables 


We have already seen that doing hypothesis tests and constructing confidence inter- 
vals using nae or some other Z ratio requires that we know certain upper and/or 
lower percentiles from the standard normal distribution. There will be a similar need 
to identify appropriate “cutoffs” from Student ¢ distributions when the inference 
procedure is based on ee. or some other f ratio. 

Figure 7.4.1 shows a portion of the ¢ table that appears in the back of every 
statistics book. Each row corresponds to a different Student t pdf. The column 
headings give the area to the right of the number appearing in the body of the 


table. 


df  .20 AS 10 .05 .025 O01 .005 


1 1.376 1.963 3.078 6.3138 12.706 31.821 63.657 

2 1.061 1.386 1.886 2.9200 4.3027 6.965 9.9248 
3 0.978 1.250 1.638 2.3534 3.1825 4.541 5.8409 
4 0.941 1.190 1.533 2.1318 2.7764 3.747 4.6041 
i) 
6 


0.920 1.156 1.476 2.0150 2.5706 3.365 4.0321 
0.906 1.134 1.440 1.9432 2.4469 3.143 3.7074 


For example, the entry 4.54/ listed in the a =.01 column and the df =3 row has 
the property that P(T3; > 4.541) =0.01. 

More generally, we will use the symbol #,,, to denote the 100(1 — a)th percentile 
of fr,(t). That is, P(T, > te.n) =a (see Figure 7.4.2). No lower percentiles of Student 
t curves need to be tabulated because the symmetry of fr (t) implies that P(T, < 
ton) =a. 

The number of different Student t pdfs summarized in a ¢ table varies consid- 
erably. Many tables will provide cutoffs for degrees of freedom ranging only from 1 
to 30; others will include df values from 1 to 50, or even from 1 to 100. The last row 
in any f table, though, is always labeled “oo”: Those entries, of course, correspond 
tO Zy. 
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Figure 7.4.2 


Theorem 
7.4.1 


t 


t 
an 


Constructing a Confidence Interval for ju 


The fact that a4 has a Student ¢ distribution with n — | degrees of freedom justifies 


the statement that 


Y-p 
P ae = og ee =l-a 


or, equivalently, that 


= S = S 
PUY ~tyjan-1* = SUSY t+ happn-1- =) =1- 741 
( /2,n—1 To + te/2,n-1 =) a ( ) 
(provided the Y;’s are a random sample from a normal distribution). 
When the actual data values are then used to evaluate Y and S, the lower 
and upper endpoints identified in Equation 7.4.1 define a 100(1 — w)% confidence 
interval for jp. 


Let y\, y2,---,¥n be a random sample of size n from a normal distribution with 
(unknown) mean yw. A 1001 — a) % confidence interval for is the set of values 


S 


= So 
(5 = la/2,n—-1 . Jin y te ty/2,n—1 . =) 


Case Study 7.4.1 


To hunt flying insects, bats emit high-frequency sounds and then listen for their 
echoes. Until an insect is located, these pulses are emitted at intervals of from 
fifty to one hundred milliseconds. When an insect is detected, the pulse-to-pulse 
interval suddenly decreases—sometimes to as low as ten milliseconds—thus 
enabling the bat to pinpoint its prey’s position. This raises an interesting ques- 
tion: How far apart are the bat and the insect when the bat first senses that 
the insect is there? Or, put another way, what is the effective range of a bat’s 
echolocation system? 

The technical problems that had to be overcome in measuring the bat-to- 
insect detection distance were far more complex than the statistical problems 
involved in analyzing the actual data. The procedure that finally evolved was 
to put a bat into an eleven-by-sixteen-foot room, along with an ample supply 


(Continued on next page) 
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of fruit flies, and record the action with two synchronized sixteen-millimeter 
sound-on-film cameras. By examining the two sets of pictures frame by frame, 
scientists could follow the bat’s flight pattern and, at the same time, monitor its 
pulse frequency. For each insect that was caught (65), it was therefore possible 
to estimate the distance between the bat and the insect at the precise moment 
the bat’s pulse-to-pulse interval decreased (see Table 7.4.1). 


Table 7.4.1 


Catch Number Detection Distance (cm) 


POO mANANFWNFR 
A 
n 


RR 


Define yz to be a bat’s true average detection distance. Use the eleven 
observations in Table 7.4.1 to construct a 95% confidence interval for jw. 
Letting y; = 62, y» =52,..., yy =40, we have that 


11 11 
>> y= 532 and ~ y;=29,000 


i=l i=1 


Therefore, 


and 


= 18.1 cm 


11(29,000) — (532)2 
11(10) 


If the population from which the y,;’s are being drawn is normal, the 
behavior of 


Y—-p 
S/Jn 


will be described by a Student ¢ curve with 10 degrees of freedom. From 
Table A.2 in the Appendix, 


P(—2.2281 < Tip < 2.2281) =0.95 


(Continued on next page) 
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Example 


7.4.1 


(Case Study 7.4.1 continued) 


Accordingly, the 95% confidence interval for ju is 


[7 2.2281 (=) on (Fa) 


=|48. 2.2081 (=) #8.442.2281(—)| 


= (36.2. cm, 60.6 cm). 


The sample mean and sample standard deviation for the random sample of size 
n= 20 given in the following list are 2.6 and 3.6, respectively. Let 4 denote the true 
mean of the distribution being represented by these y,’s. 


Is it correct to say that a 95% confidence interval for y is the set of following values? 


RY Ss 
y-t 1° ytt a = 
(3 .025,n—1 Vn y 025,n—1 =) 


3.6 3.6 
= | 2.6 — 2.0930 - —, 2.6+ 2.0930- =) 
( V 20 V20 


= (0.9, 4.3) 


No. It is true that all the correct factors have been used in calculating (0.9, 4.3), 
but Theorem 7.4.1 does not apply in this case because the normality assumption 
it makes is clearly being violated. Figure 7.4.3 is a histogram of the twenty y;’s. The 
extreme skewness that is so evident there is not consistent with the presumption that 
the data’s underlying pdf is a normal distribution. As a result, the pdf describing the 


probabilistic behavior of sn would not be fr,,(t). 


10 


Frequency 
mn 


0 a 10 


Figure 7.4.3 = 
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Comment To say that is in this situation is not exactly a T\y random vari- 
able leaves unanswered a critical question: Is the ratio approximately a Ty ran- 
dom variable? We will revisit the normality assumption—and what happens when 
that assumption is not satisfied—later in this section when we discuss a critically 


important property known as robustness. 


Questions 


7.4.1. Use Appendix Table A.2 to find the following 
probabilities: 


(a) P(Ts> 1.134) 

(b) P(Tis < 0.866) 

(c) P(T; > —1.250) 

(d) P(—1.055 < Ty < 2.462) 


7.4.2. What values of x satisfy the following equations? 


(a) P(-x < Tx es x) = 0.98 
(c) P (To < x) = 0.95 
(d) P(T;> x) =0.025 


7.4.3. Which of the following differences is larger? 
Explain. 


fosin —fion OF lion — F15.n 


7.4.4. A random sample of size n = 9 is drawn from a not- 
mal distribution with 4 = 27.6. Within what interval (—a, 


+a) can we expect to find me 80% of the time? 90% of 
the time? 


7.4.5. Suppose a random sample of size n = 11 is drawn 
from a normal distribution with 4 = 15.0. For what value 
of k is the following true? 


p(|2= 59), .) -0.05 
s/f |— J 


7.4.6. Let Y and S denote the sample mean and sam- 
ple standard deviation, respectively, based on a set of 
n = 20 measurements taken from a normal distribution 
with w = 90.6. Find the function k(S) for which 


P[90.6 — k(S) < Y < 90.6 +k(S)] =0.99 


7.4.7. Cell phones emit radio frequency energy that is 
absorbed by the body when the phone is next to the ear 
and may be harmful. The table in the next column gives 
the absorption rate for a random sample of twenty cell 
phones. (The Federal Communication Commission sets a 
maximum of 1.6 watts per kilogram for the absorption rate 
of such energy.) Construct a 90% confidence interval for 
the true average cell phone absorption rate. 


0.87 0.72 
1.30 1.05 
0.79 0.61 
1.45 1.01 
1.15 0.20 
1.31 0.67 
1.09 1.35 
0.66 1.27 
0.49 1.28 
1.40 1.55 


Source: reviews.cnet.com/cell-phone-radiation-levels/ 


7.4.8. The following table lists the typical cost of repairing 
the bumper of a moderately priced midsize car damaged 
by a corner collision at 3 mph. Use these observations 
to construct a 95% confidence interval for uw, the true 
average repair cost for all such automobiles with similar 
damage. The sample standard deviation for these data is 
s = $369.02. 


Repair Repair 

Make/Model Cost Make/Model Cost 

Hyundai Sonata $1019 Honda Accord $1461 
Nissan Altima $1090 Volkswagen Jetta $1525 
Mitsubishi Galant $1109 Toyota Camry $1670 
Saturn AURA $1235 Chevrolet Malibu $1685 
Subaru Legacy $1275 Volkswagen Passat $1783 
Pontiac G6 $1361 Nissan Maxima $1787 
Mazda 6 $1437 Ford Fusion $1889 
Volvo S40 $1446 Chrysler Sebring $2484 


Source: www.iihs.org/ratings/bumpersbycategory.aspx? 


7.4.9. Creativity, as any number of studies have shown, is 
very much a province of the young. Whether the focus is 
music, literature, science, or mathematics, an individual’s 
best work seldom occurs late in life. Einstein, for example, 
made his most profound discoveries at the age of twenty- 
six; Newton, at the age of twenty-three. The following are 
twelve scientific breakthroughs dating from the middle of 
the sixteenth century to the early years of the twentieth 
century (205). All represented high-water marks in the 
careers of the scientists involved. 


400 Chapter 7 Inferences Based on the Normal Distribution 


Discovery Discoverer Year Age, y 

Earth goes around sun Copernicus 1543 40 

Telescope, basic laws of Galileo 1600 34 
astronomy 

Principles of motion, Newton 1665 23 
gravitation, calculus 

Nature of electricity Franklin 1746 = 40 

Burning is uniting with Lavoisier 177431 
oxygen 

Earth evolved by gradual Lyell 1830 = 33 
processes 

Evidence for natural Darwin 1858 49 
selection controlling 
evolution 

Field equations for light Maxwell 1864 33 

Radioactivity Curie 1896 =. 34 

Quantum theory Planck 1901 = 43 

Special theory of relativity, | Einstein 1905 26 


E=mc 
Mathematical foundations 
for quantum theory 


Schrodinger 1926 


(a) What can be inferred from these data about the 
true average age at which scientists do their best 
work? Answer the question by constructing a 95% 
confidence interval. 

Before constructing a confidence interval for a set of 
observations extending over a long period of time, 
we should be convinced that the y;’s exhibit no biases 
or trends. If, for example, the age at which scien- 
tists made major discoveries decreased from century 
to century, then the parameter ~ would no longer 
be a constant, and the confidence interval would 
be meaningless. Plot “date” versus “age” for these 
twelve discoveries. Put “date” on the abscissa. Does 
the variability in the y,’s appear to be random with 
respect to time? 


(b) 


7.4.10. How long does it take to fly from Atlanta to New 
York’s LaGuardia airport? There are many components 
of the time elapsed, but one of the more stable measure- 
ments is the actual in-air time. For a sample of sixty-one 
flights between these destinations on Sundays in April, the 
time in minutes (y) gave the following results: 


61 61 
| yi = 6450 and ) ° y? = 684, 900 
i=l i=l 


Find a 99% confidence interval for the average flight time. 


Source: www.bts.gov/xml/ontimesummarystatistics/src/ 
dstat/OntimeSummaryDepaturesData.xml. 


7.4.11. In a nongeriatric population, platelet counts 
ranging from 140 to 440 (thousands per mm? of blood) 


are considered “normal.” The following are the platelet 
counts recorded for twenty-four female nursing home 
residents (169). 


Subject Count Subject Count 
1 125 13 180 
2 170 14 180 
3 250 15 280 
4 270 16 240 
5 144 17 270 
6 184 18 220 
7 176 19 110 
8 100 20 176 
9 220 21 280 
10 200 22 176 
11 170 23 188 
12 160 24 176 
Use the following sums: 
24 24 


> yi =4645 and )- y7=959,265 
i=l i=l 

How does the definition of “normal” above compare with 
the 90% confidence interval? 


7.4.12. Ifa normally distributed sample of size n = 16 pro- 
duces a 95% confidence interval for w that ranges from 
44.7 to 49.9, what are the values of y and s? 


7.4.13. Two samples, each of size n, are taken from a 
normal distribution with unknown mean pw and unknown 
standard deviation o. A 90% confidence interval for pu is 
constructed with the first sample, and a 95% confidence 
interval for jz is constructed with the second. Will the 95% 
confidence interval necessarily be longer than the 90% 
confidence interval? Explain. 


7.4.14. Revenues reported last week from nine boutiques 
franchised by an international clothier averaged $59,540 
with a standard deviation of $6860. Based on those figures, 
in what range might the company expect to find the 
average revenue of all of its boutiques? 


7.4.15. What “confidence” is associated with each of the 
following random intervals? Assume that the Y,’s are 
normally distributed. 


(a) |7 -2.0530( +) .7-+2.0530(—5-) | 
(b) | 1345(—5) 74 1345(—+.)| 
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7.4.16. The weather station at Dismal Swamp, 


Rainfall in inches Frequency 
California, recorded monthly peeaptanon (y) for ea 


eight years. For these data, = = 1392.6 and Ss i ie 
10, 518.84. 2-3 35 
3-4 41 

4-5 28 

(a) Find the 95% confidence interval for the mean 5-6 24 
monthly precipitation. 6-7 18 

(b) The table on the right gives a frequency ditri- 78 16 
bution for the Dismal Swamp precipitation data. 8-9 16 
Does this distribution raise questions about using 9-10 5 
Theorem 7.4.1? 10-11 9 
11-12 21 


Source: www.wcc.nrcs.usda -ZOv. 


Testing H, : 4 = 4, (The One-Sample t Test) 


Suppose a normally distributed random sample of size n is observed for the purpose 
of testing the null hypothesis that 4 = wo. If o is unknown—which is usually the 
case—the procedure we use is called a one-sample t test. Conceptually, the latter is 
much like me Z test of Theorem - ae except that the decision rule is defined in 
terms of t = 7 Ut rather than z= 
from fr,_,(t) rather than fz - 


Theorem Let y1, y2,---; Yn be a random sample of size n from a normal distribution where o is 
7.4.2 unknown. Let t = ie 


a. To test Hy: 4 = Mo versus Hy: L > [Lo at the a level of significance, reject Ho if 
t= tun—1- 

b. To test Hy: 4 = Uo versus Hy: tb < [Lo at the a level of significance, reject Ho if 
t< —to,n-1- 

c. To test Hy: “= Uy versus H,: 1 # pw, at the a level of significance, reject Hp if t is 
either (1) = —le/2,n-1 or (2) = tu/2,.n—1- 


Proof Appendix 7.A.3 gives the complete derivation that ass using the proce- 
dure described in Theorem 7.4.2. In short, the test statistic t = a Wt is a monotonic 


function of the 4 that appears in Definition 6.5.2, which makes the one-sample f test 
a GLRT. 


Case Study 7.4.2 


Not all rectangles are created equal. Since antiquity, societies have expressed 
aesthetic preferences for rectangles having certain width (w) to length (/) ratios. 

One “standard” calls for the width-to-length ratio to be equal to the ratio 
of the length to the sum of the width and the length. That is, 


(Continued on next page) 
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(Case Study 7.4.2 continued) 


w l 

haves, (7.4.2) 
Equation 7.4.2 implies that the width is 5(V/5 — 1), or approximately 0.618, times 
as long as the length. The Greeks called this the golden rectangle and used it 
often in their architecture (see Figure 7.4.4). Many other cultures were similarly 
inclined. The Egyptians, for example, built their pyramids out of stones whose 
faces were golden rectangles. Today in our society, the golden rectangle remains 
an architectural and artistic standard, and even items such as driver’s licenses, 
business cards, and picture frames often have w// ratios close to 0.618. 


I 


Figure 7.4.4 A golden rectangle (4 = 5) 

The fact that many societies have embraced the golden rectangle as an aes- 
thetic standard has two possible explanations. One, they “learned” to like it 
because of the profound influence that Greek writers, philosophers, and artists 
have had on cultures all over the world. Or two, there is something unique about 
human perception that predisposes a preference for the golden rectangle. 

Researchers in the field of experimental aesthetics have tried to test the 
plausibility of those two hypotheses by seeing whether the golden rectangle is 
accorded any special status by societies that had no contact whatsoever with 
the Greeks or with their legacy. One such study (37) examined the w/J ratios 
of beaded rectangles sewn by the Shoshoni Indians as decorations on their 
blankets and clothes. Table 7.4.2 lists the ratios found for twenty such rectangles. 

If, indeed, the Shoshonis also had a preference for golden rectangles, we 
would expect their ratios to be “close” to 0.618. The average value of the entries 
in Table 7.4.2, though, is 0.667. What does that imply? Is 0.661 close enough 
to 0.618 to support the position that liking the golden rectangle is a human 
characteristic, or is 0.661 so far from 0.618 that the only prudent conclusion is 
that the Shoshonis did not agree with the aesthetics espoused by the Greeks? 


Table 7.4.2 Width-to-Length Ratios of Shoshoni Rectangles 
0.693 0.749 0.654 0.670 
0.662 0.672 0.615 0.606 
0.690 0.628 0.668 0.611 
0.606 0.609 0.601 0.553 
0.570 0.844 0.576 0.933 


(Continued on next page) 


Example 
7.4.2 


7.4 Drawing Inferences About ~ 403 


Let y denote the true average width-to-length ratio of Shoshoni rectangles. 
The hypotheses to be tested are 


Hy: =0.618 
versus 
H,: 40.618 


For tests of this nature, the value of a = 0.05 is often used. For that value of 
a and a two-sided test, the critical values, using part (c) of Theorem 7.4.2 and 
Appendix Table A.2, are to25,19 = 2.0930 and —f.25, 19 = —2.0930. 

The data in Table 7.4.2 have y = 0.661 and s = 0.093. Substituting these 
values into the ¢ ratio gives a test statistic that lies just inside of the interval 
between —2.0930 and 2.0930: 


1 Do Ho _ 0.661 — 0.618 _ 
s/J/n 0.093 /+/20 


Thus, these data do not rule out the possibility that the Shoshoni Indians also 
embraced the golden rectangle as an aesthetic standard. 


2.068 


About the Data Like z and ¢, the ratio w// for golden rectangles (more commonly 
referred to as either phi or the golden ratio), is an irrational number with all sorts of 
fascinating properties and connections. 

Algebraically, the solution of the equation 


w I 
l wl 
is the continued fraction 
1 
palit i 
ve 7 
oe 


Among the curiosities associated with phi is its relationship with the Fibonacci series. 
The latter, of course, is the famous sequence in which each term is the sum of its two 
predecessors — that is, 


1 12 3 5 8 13 21 34 55 89 


Three banks serve a metropolitan area’s inner-city neighborhoods: Federal Trust, 
American United, and Third Union. The state banking commission is concerned 
that loan applications from inner-city residents are not being accorded the same con- 
sideration that comparable requests have received from individuals in rural areas. 
Both constituencies claim to have anecdotal evidence suggesting that the other 
group is being given preferential treatment. 

Records show that last year these three banks approved 62% of all the home 
mortgage applications filed by rural residents. Listed in Table 7.4.3 are the approval 
rates posted over that same period by the twelve branch offices of Federal Trust 
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Table 7.4.3 
Bank Location Affiliation Percent Approved 
1 3rd & Morgan AU 59 
2 Jefferson Pike TU 65 
3 East 150th & Clark TU 69 
4 Midway Mall FT 53 
5 N. Charter Highway FT 60 
6 Lewis & Abbot AU 53 
7 West 10th & Lorain FT 58 
8 Highway 70 FT 64 
9 Parkway Northwest AU 46 
10 Lanier & Tower TU 67 
11 King & Tara Court AU 51 
12 Bluedot Corners FT 59 


(FT), American United (AU), and Third Union (TU) that work primarily with the 
inner-city community. Do these figures lend any credence to the contention that 
the banks are treating inner-city residents and rural residents differently? Analyze 
the data using an a = 0.05 level of significance. 

As a Starting point, we might want to test 


Ao: h= 62 
versus 
Hy: #62 


where jz is the true average approval rate for all inner-city banks. Table 7.4.4 summa- 
rizes the analysis. The two critical values are + f25,1; =+2.2010, and the observed t 


ratio is —1.66 (= 28-667-©2 \ 59 our decision is “Fail to reject Ho.” 
6.946/V 12 )” J 0 


Table 7.4.4 


Banks n y Ss t Ratio Critical Value Reject Ho? 


All 12 58.667 6.946 —1.66 +2.2010 No 


About the Data The “overall” analysis of Table 7.4.4, though, may be too simplis- 
tic. Common sense would tell us to look also at the three banks separately. What 
emerges, then, is an entirely different picture (see Table 7.4.5). Now we can see 
why both groups felt discriminated against: American United (f = —3.63) and Third 


Table 7.4.5 

Banks n y st Ratio Critical Value Reject Hy? 
American United 4 52.25 5.38 —3.63 +3.1825 Yes 
Federal Trust 5 58.80 3.96 —1.81 +2.7764 No 
Third Union 3 67.00 2.00 +4.33 +4.3027 Yes 
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Union (t = +4.33) each had rates that differed significantly from 62% — but in oppo- 
site directions! Only Federal Trust seems to be dealing with inner-city residents and 
rural residents in an even-handed way. o 


Questions 


7.4.17. Recall the Bacillus subtilis data in Question 5.3.2. 
Test the null hypothesis that exposure to the enzyme does 
not affect a worker’s respiratory capacity (as measured by 
the FEV,/VC ratio). Use a one-sided H, and let a = 0.05. 
Assume that o is not known. 


7.4.18. Recall Case Study 5.3.1. Assess the credibility of 
the theory that Etruscans were native Italians by test- 
ing an appropriate Hy) against a two-sided Hy. Set a 
equal to 0.05. Use 143.8 mm and 6.0 mm for y and s, 
respectively, and let yw, = 132.4. Do these data appear to 
satisfy the distribution assumption made by the ¢ test? 
Explain. 


7.4.19. MBAs R Us advertises that its program increases 
a person’s score on the GMAT by an average of forty 
points. As a way of checking the validity of that claim, a 
consumer watchdog group hired fifteen students to take 
both the review course and the GMAT. Prior to starting 
the course, the fifteen students were given a diagnos- 
tic test that predicted how well they would do on the 
GMAT in the absence of any special training. The fol- 
lowing table gives each student’s actual GMAT score 
minus his or her predicted score. Set up and carry out 
an appropriate hypothesis test. Use the 0.05 level of 
significance. 


Subject y;=act.GMAT-—pre.GMAT _ y? 
SA 35 1225 
LG 37 1369 
SH 33 1089 
KN 34 1156 
DF 38 1444 
SH 40 1600 
ML 35 1225 
JG 36 1296 
KH 38 1444 
HS 33 1089 
LL 28 784 
CE 34 1156 
KK 47 2209 
CW 42 1764 
DP 46 2116 


7.4.20. In addition to the Shoshoni data of Case 
Study 7.4.2, a set of rectangles that might tend to the 
golden ratio are national flags. The table below gives the 
width-to-length ratios for a random sample of the flags of 
thirty-four countries. Let w be the width-to-length ratio 
for national flags. At the a =0.01 level, test Hp: 1 = 0.618 
versus H,: 440.618. 


Ratio Ratio 

Country Widthto Height Country Width to Height 
Afghanistan 0.500 Iceland 0.720 
Albania 0.714 Tran 0.571 
Algeria 0.667 Israel 0.727 
Angola 0.667 Laos 0.667 
Argentina 0.667 Lebanon 0.667 
Bahamas 0.500 Liberia 0.526 
Denmark 0.757 Macedonia 0.500 
Djibouti 0.553 Mexico 0.571 
Ecuador 0.500 
Egypt 0.667 Monaco 0.800 
El 0.600 Namibia 0.667 
Salvador 

Nepal 1.250 
Estonia 0.667 Romania 0.667 
Ethiopia 0.500 Rwanda 0.667 
Gabon 0.750 South 0.667 

Africa 
Fiji 0.500 St. 0.500 

Helena 
France 0.667 Sweden 0.625 
Honduras 0.500 United 0.500 

Kingdom 


Source: http://www.anyflag.com/country/costaric.php. 


7.4.21. A manufacturer of pipe for laying underground 
electrical cables is concerned about the pipe’s rate of 
corrosion and whether a special coating may retard that 
rate. As a way of measuring corrosion, the manufac- 
turer examines a short length of pipe and records the 
depth of the maximum pit. The manufacturer’s tests have 
shown that in a year’s time in the particular kind of soil 
the manufacturer must deal with, the average depth of 
the maximum pit in a foot of pipe is 0.0042 inch. To 
see whether that average can be reduced, ten pipes are 
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coated with a new plastic and buried in the same soil. 
After one year, the following maximum pit depths are 
recorded (in inches): 0.0039, 0.0041, 0.0038, 0.0044, 0.0040, 
0.0036, 0.0034, 0.0036, 0.0046, and 0.0036. Given that the 
sample standard deviation for these ten measurements is 
0.00383 inch, can it be concluded at the a = 0.05 level of 
significance that the plastic coating is beneficial? 


7.4.22. The first analysis done in Example 7.4.2 (using all 
n= 12 banks with y = 58.667) failed to reject Ho: 4 = 62 
at the a = 0.05 level. Had wz, been, say, 61.7 or 58.6, the 
same conclusion would have been reached. What do we 
call the entire set of u,’s for which Ho: w= yw, would not 
be rejected at the a = 0.05 level? 


Testing H,: 2 = 4, When the Normality Assumption Is Not Met 


Every t test makes the same explicit assumption—namely, that the set of n y,’s is 
normally distributed. But suppose the normality assumption is not true. What are 
the consequences? Is the validity of the t test compromised? 

Figure 7.4.5 addresses the first question. We know that if the normality assump- 


tion is true, the pdf describing the variation of the f ratio 


Y-wo 


> STR? is fr,_,(t). The 


latter, of course, provides the decision rule’s critical values. If Ho : 4 = [o is to be 
tested against H; : 1 #4 4,, for example, the null hypothesis is rejected if t is either 
(1) < —te/2,n—1 OF (2) > ta/2,n—1 (which makes the Type I error probability equal to a). 


Figure 7.4.5 f;«(0) = pdf of t when data are 


not normally distributed 


Area = a/2 \ Pea 


fr A) = pdf of t when 
a= 


data are normally distributed 


Reject Hy 


If the normality assumption is not true, the pdf of 


Y— po 
p(T 


Lo, n-1 


Reject Hy 


we will not be fr,_,(t) and 


S -la/2.n-1 +P ON ice Fa 
: SpJa 


In effect, violating the normality assumption creates two a’s: The “nominal” a is the 
Type I error probability we specify at the outset —typically, 0.05 or 0.01. The “true” 


a is the actual probability that 


Y—-WUo 


AG falls in the rejection region (when Ap is true). 


For the two-sided decision rule pictured in Figure 7.4.5, 


—ta/2,n—1 
truea = i 
—oo 


fr«(t) dt +f Fr«(t) dt 


ta /2,n—1 


Whether or not the validity of the ¢ test is “compromised” by the normality 
assumption being violated depends on the numerical difference between the two 
a’s. If fr+(t) is, in fact, quite similar in shape and location to fr,_,(t), then the true 
will be approximately equal to the nominal a. In that case, the fact that the y;’s are 
not normally distributed would be essentially irrelevant. On the other hand, if f7.(f) 
and fr,_,(t) are dramatically different (as they appear to be in Figure 7.4.5), it would 
follow that the normality assumption is critical, and establishing the “significance” 
of a t ratio becomes problematic. 


Figure 7.4.6 
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Unfortunately, getting an exact expression for f7+(f) is essentially impossible, 
because the distribution depends on the pdf being sampled, and there is seldom any 
way of knowing precisely what that pdf might be. However, we can still meaningfully 
explore the sensitivity of the ¢ ratio to violations of the normality assumption by 
simulating samples of size n from selected distributions and comparing the resulting 
histogram of f ratios to fr,_, (t). 

Figure 7.4.6 shows four such simulations, using Minitab; the first three consist 
of one hundred random samples of size n = 6. In Figure 7.4.6(a), the samples come 
from a uniform pdf defined over the interval [0, 1]; in Figure 7.4.6(b), the underlying 
pdf is the exponential with A = 1; and in Figure 7.4.6(c), the data are coming from a 
Poisson pdf with A =5. 

If the normality assumption were true, ¢ ratios based on samples of size 6 would 
vary in accordance with the Student r distribution with 5 df. On pp. 407-408, fr, (1) 
has been superimposed over the histograms of the ¢ ratios coming from the three 
different pdfs. What we see there is really quite remarkable. The f ratios based on 
y;’s coming from a uniform pdf, for example, are behaving much the same way as 
t ratios would vary if the y;’s were normally distributed —that is, fr«(t) in this case 
appears to be very similar to f7,(t). The same is true for samples coming from a 
Poisson distribution (see Theorem 4.2.2). For both of those underlying pdfs, in other 
words, the true a would not be much different from the nominal a. 

Figure 7.4.6(b) tells a slightly different story. When samples of size 6 are drawn 
from an exponential pdf, the t ratios are not in particularly close agreement with 


= 
i} 
ma 


Probability 
density 


MTB > random’100"cl-c6; 

SUBC> uniform’0"1. 

MTB > rmean”cl-c6"c7 

> rstdev”cl-c6"c8 

MIB > let’c9"="sqrt(6)*(((c7)-0.5)/(c8)) 
> histogram™c9 


This command calculates 


y-p _ y-05 

sn sN6 
Sample 
distribution 


tratio (n = 6) 


408 Chapter 7 Inferences Based on the Normal Distribution 


Figure 7.4.6 (Continued) (b) 
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Figure 7.4.6 (Continued) 
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fr; (t). Specifically, very negative t ratios are occurring much more often than the 
Student ¢ curve would predict, while large positive f ratios are occurring less often 
(see Question 7.4.23). But look at Figure 7.4.6(d). When the sample size is increased 
to n = 15, the skewness so prominent in Figure 7.4.6(b) is mostly gone. 


(d) 


0.50 sae ad 


Probability density 


MTB > random 100 cl-c15; 

SUBC> exponential 1. 

MTB > rmean cl-ci5 c16 

MTB > rstdev cl-c15 c17 

MTB > let ci8 = sqrt(15)*(((c16 - 1.0)/(c17)) 
MTB > histogram c18 


Sample 
distribution 


tratio (n= 15) 


Reflected in these specific simulations are some general properties of the t¢ 
ratio: 


1. The distribution of LL is relatively unaffected by the pdf of the y,’s [provided 


f(y) is not too skewed and n is not too small]. 


: y_ : : a 
2. Asn increases, the pdf of GE becomes increasingly similar to fr,__, (ft). 
In mathematical statistics, the term robust is used to describe a procedure that is not 
heavily dependent on whatever assumptions it makes. Figure 7.4.6 shows that the t 
test is robust with respect to departures from normality. 

From a practical standpoint, it would be difficult to overstate the importance 

j y_ ; 3 3 

of the ¢ test being robust. If the pdf of ak varied dramatically depending on the 
origin of the y;’s, we would never know if the true @ associated with, say, a 0.05 
decision rule was anywhere near 0.05. That degree of uncertainty would make the t 


test virtually worthless. 
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Questions 


7.4.23. Explain why the distribution of ¢ ratios calcu- 
lated from small samples drawn from the exponential 
pdf, fy(y) =e’, y = 0, will be skewed to the left [recall 
Figure 7.4.6(b)]. [Hint: What does the shape of fy(y) imply 
about the possibility of each y; being close to 0? If the 
entire sample did consist of y;’s close to 0, what value 
would the ¢ ratio have? 


7.4.24. Suppose one hundred samples of size n = 3 are 
taken from each of the pdfs 


() fry) =2y, O<y<l 
and 
(2) fr”) =4y", 
and for each set of three observations, the ratio 
you 
s/V3 


is calculated, where yw is the expected value of the par- 
ticular pdf being sampled. How would you expect the 


O<y<l 


distributions of the two sets of ratios to be different? How 
would they be similar? Be as specific as possible. 


7.4.25. Suppose that random samples of size n are drawn 
from the uniform pdf, f,(y) =1,0< y <1. For each sam- 
ple, the ratio t= — is calculated. Parts (b) and (d) of 
Figure 7.4.6 suggest that the pdf of t will become increas- 
ingly similar to fy ,(t) as n increases. To which pdf is 
fr,_, (t), itself, converging as n increases? 


7.4.26. On which of the following sets of data would you 
be reluctant to do ar test? Explain. 


(a) e e @ © e y 
(b) oe y 
(c) e e ese y 


7.5 Drawing Inferences About o7 


When random samples are drawn from a normal distribution, it is usually the case 
that the parameter j is the target of the investigation. More often than not, the mean 
mirrors the “effect” of a treatment or condition, in which case it makes sense to 
apply what we learned in Section 7.4—that is, either construct a confidence interval 
for x or test the hypothesis that z= po. 

But exceptions are not that uncommon. Situations occur where the “precision” 
associated with a measurement is, itself, important— perhaps even more important 
than the measurement’s “location.” If so, we need to shift our focus to the scale 
parameter, o7. Two key facts that we learned earlier about the population variance 
will now come into play. First, an unbiased estimator for o* based on its maximum 
likelihood estimator is the sample variance, S*, where 


And, second, the ratio 


has a chi square distribution with n — | degrees of freedom. Putting these two pieces 
of information together allows us to draw inferences about o* —in particular, we can 
construct confidence intervals for o” and test the hypothesis that ¢? =o? 


o° 


Chi Square Tables 


Just as we need a f table to carry out inferences about jz (when co? is unknown), we 
need a chi square table to provide the cutoffs for making inferences involving o”. The 
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layout of chi square tables is dictated by the fact that all chi square pdfs (unlike Z 
and t distributions) are skewed (see, for example, Figure 7.5.1, showing a chi square 
curve having 5 degrees of freedom). Because of that asymmetry, chi square tables 
need to provide cutoffs for both the left-hand tail and the right-hand tail of each chi 
square distribution. 


Figure 7.5.1 " 
“se fee) = (3v2nyty**e"? 
2) 
NS 


Probability density 


Area = 0.01 


-—. 
“=<. 


1.145 15.086 


Figure 7.5.2 shows the top portion of the chi square table that appears in 
Appendix A.3. Successive rows refer to different chi square distributions (each hav- 
ing a different number of degrees of freedom). The column headings denote the 
areas to the left of the numbers listed in the body of the table. 


Figure 7.5.2 


P 
df OL 025 05 10 90 95 975 99 
1 0.000157 0.000982 0.00393 0.0158 2.706 3.841 5.024 6.635 
2 0.0201 0.0506 0.103 0.211 4.605 5.991 7.378 9.210 
3 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 
4 0.297 0.484 0.711 1.064 7.779 9488 11.143 13.277 
5 0.554 0.831 1.145 1.610 9.236 11.070 12.832 15.086 
6 0.872 1.237 1.635 2.204 10.645 12.592 14449 16.812 
7 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 
8 1.646 2.180 2.133 3.490 13.362 15.507 17.535 20.090 
9 2.088 2.700 3,325 4168 14.684 16.919 19.023 21.666 


10 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 
11 3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 
12 3.571 4.404 5.226 6.304 =18.549 21.026 23.336 26.217 


We will use the symbol ee to denote the number along the horizontal axis that 
cuts off, to its left, an area of p under the chi square distribution with n degrees of 
freedom. For example, from the fifth row of the chi square table, we see the num- 
bers 1.145 and 15.086 under the column headings .05 and .99, respectively. It follows 
that 


P(x5 < 1.145) =0.05 
and 
P(x3 < 15.086) =0.99 


(see Figure 7.5.1). In terms of the re , notation, 1.145 = ta 5 and 15.086 = x4 5. 
(The area to the right of 15.086, of course, must be 0.01.) 
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Theorem 
7.5.1 


Constructing Confidence Intervals for o” 


: —1)82 : : : . : 
Since aie has a chi square distribution with n — 1 degrees of freedom, we can 
write 


(n — 1)S? 
P E= = a: = Xi ghost =l-a (7.0.1) 


oO 


If Equation 7.5.1 is then inverted to isolate o* in the center of the inequalities, 
the two endpoints will necessarily define a 100(1 — w)% confidence interval for the 
population variance. The algebraic details will be left as an exercise. 


Let s* denote the sample variance calculated from a random sample of n observations 
drawn from a normal distribution with mean js and variance o*. Then 


a. a 100(1 — a)% confidence interval for o? is the set of values 


(n—1)s? (n—1)s? 
Xen Lea3 


b. a 10001 — a)% confidence interval for o is the set of values 
/ (n — 1)s2 ff —1)s2 
Me ee Sted 


Case Study 7.5.1 


The chain of events that define the geological evolution of the Earth began 
hundreds of millions of years ago. Fossils play a key role in documenting the 
relative times those events occurred, but to establish an absolute chronology, 
scientists rely primarily on radioactive decay. 

One of the newest dating techniques uses a rock’s potassium-argon ratio. 
Almost all minerals contain potassium (K) as well as certain of its isotopes, 
including “°K. The latter, though, is unstable and decays into isotopes of argon 
and calcium, *°Ar and *°Ca. By knowing the rates at which the various daughter 
products are formed and by measuring the amounts of *°Ar and *°K present in 
a specimen, geologists can estimate the object’s age. 

Critical to the interpretation of any such dates, of course, is the precision 
of the underlying procedure. One obvious way to estimate that precision is to 
use the technique on a sample of rocks known to have the same age. Whatever 
variation occurs, then, from rock to rock is reflecting the inherent precision (or 
lack of precision) of the procedure. 

Table 7.5.1 lists the potassium-argon estimated ages of nineteen mineral 
samples, all taken from the Black Forest in southeastern Germany (111). 
Assume that the procedure’s estimated ages are normally distributed with 
(unknown) mean j and (unknown) variance o”. Construct a 95% confidence 
interval for o. 


(Continued on next page) 
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Table 7.5.1 
Specimen Estimated Age (millions of years) 
1 249 
2 254 
3 243 
4 268 
5 253 
6 269 
7 287 
8 241 
9 273 
10 306 
11 303 
12 280 
13 260 
14 256 
15 278 
16 344 
17 304 
18 283 
19 310 
Here 


19 
Y > yi = 5261 
i=l 


19 
Sy? = 1,469,945 


i=l 
so the sample variance is 733.4: 


2 — 191,469,945) — (5261)? 


= 733.4 
19(18) 


Since n = 19, the critical values appearing in the left-hand and right-hand limits 
of the o confidence interval come from the chi square pdf with 78 df. According 
to Appendix Table A.3, 


P(8.23 < xjz < 31.53) =0.95 


so the 95% confidence interval for the potassium-argon method’s precision is 
the set of values 


ye 1)(733.4) /(19 — 1)(733.4) 


' = (20.5 million years, 40.0 million years) 
31.53 8.23 
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Example The width of a confidence interval for o? is a function of both n and S?: 
7.5.1 


Width = upper limit — lower limit 
_@-)S? m-1s? 


Ned Xt_ejan—t 
1 1 
=(n- 1S? | = 5 (7.5.2) 
Xa/2,n-1 Xi-a/2.n-1 


As n gets larger, the interval will tend to get narrower because the unknown o? is 
being estimated more precisely. What is the smallest number of observations that 
will guarantee that the average width of a 95% confidence interval for o? is no 
greater than o?? 

Since S* is an unbiased estimator for o”, Equation 7.5.2 implies that the 
expected width of a 95% confidence interval for the variance is the expression 


Lewin) = 2 =o = : 


2 
X.025,n-1 — X.975,n-1 


Clearly, then, for the expected width to be less than or equal to o”, n must be chosen 


so that 
1 1 
a—)( 5 ; 21 
X025,n-1 —-X.975,n-1 


Trial and error can be used to identify the desired n. The first three columns in 
Table 7.5.2 come from the chi square distribution in Appendix Table A.3. As the 
computation in the last column indicates, n = 39 is the smallest sample size that will 
yield 95% confidence intervals for o” whose average width is less than o?. 


Table 7.5.2 
2 A 1 1 

hs X.025.n—1 X.975,n—1 eee, Gare oe) 
15 5.629 26.119 1.95 

20 8.907 32.852 1.55 

30 16.047 45.722 1.17 

38 22.106 55.668 1.01 

39 22.878 56.895 0.99 


Testing H,: 0’ = 07 


The generalized likelihood ratio criterion introduced in Section 6.5 can be used to 
set up hypothesis tests for o?. The complete derivation appears in Appendix 7.A.4. 
Theorem 7.5.2 states the resulting decision rule. Playing a key role—just as it did 
in the construction of confidence intervals for o*—is the chi square ratio from 
Theorem 7.3.2. 
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Theorem Let s? denote the sample variance calculated from a random sample of n observa- 
7.5.2 tions drawn from a normal distribution with mean js and variance o*. Let x* = 


(n — Ls foe. 


a. To test Hy:0? =02 versus H,: 07 >? at the a level of significance, reject Ho if 
D2 
x= Xi—a,n—1" 
b. To test Hy:07 =o? versus H,:07 <0? at the a level of significance, reject Ho if 
ee) 
xs Xan-1" 
c. To test Hy:07 =o versus H,:07 £0? at the a level of significance, reject Hy if x” 
is either (1) < Xe pra or (2)> a 


Case Study 7.5.2 


Mutual funds are investment vehicles consisting of a portfolio of various types 
of investments. If such an investment is to meet annual spending needs, the 
owner of shares in the fund is interested in the average of the annual returns of 
the fund. Investors are also concerned with the volatility of the annual returns, 
measured by the variance or standard deviation. One common method of evalu- 
ating a mutual fund is to compare it to a benchmark, the Lipper Average being 
one of these. This index number is the average of returns from a universe of 
mutual funds. 

The Global Rock Fund is a typical mutual fund, with heavy investments in 
international funds. It claimed to best the Lipper Average in terms of volatility 
over the period from 1989 through 2007. Its returns are given in the table below. 


Investment Investment 


Year Return % Year Return % 
1989 15.32 1999 27.43 
1990 1.62 2000 8.57 
1991 28.43 2001 1.88 
1992 11.91 2002 —7.96 
1993 20.71 2003 35.98 
1994 —2.15 2004 14.27 
1995 23.29 2005 10.33 
1996 15.96 2006 15.94 
1997 11.12 2007 16.71 


1998 0.37 


The standard deviation for these returns is 11.28%, while the correspond- 
ing figure for the Lipper Average is 11.67%. Now, clearly, the Global Rock Fund 
has a smaller standard deviation than the Lipper Average, but is this small dif- 
ference due just to random variation? The hypothesis test is meant to answer 
such questions. 

Let o” denote the variance of the population represented by the return 
percentages shown in the table above. To judge whether the observed standard 
deviation less than 11.67 is significant requires that we test 


(Continued on next page) 
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(Case Study 7.5.2 continued) 


Let a = 0.05. With n = 19, the critical value for the chi square ratio [from 
part (b) of Theorem 7.5.2] is x7_4 ,-1 = X.05,1g = 9-390 (see Figure 7.5.3). But 


Ho: 07 = (11.67) 
versus 


Hy: 0? < (11.67)? 


~z (n—Vs?_ (19—1)11.28)" 


= 5 = 16.82 
or (11.67) 
so our decision is clear: Do not reject Ho. 
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Figure 7.5.3 


Questions 


7.5.1. Use Appendix Table A.3 to find the following 
cutoffs and indicate their location on the graph of the 
appropriate chi square distribution. 


(a) Xs, 14 
b 
1S X.90,2 
X025,9 


7.5.2. Evaluate the following probabilities: 


(a) P(x;, => 8.672) 
(b) P(x2 < 10.645) 
(c) P(9.591 < x3) < 34.170) 
(d) P(x3 <9.210) 


7.5.3. Find the value y that satisfies each of the following 
equations: 


(a) Py > y) = 0.99 

(b) P(x?s<y) =0.05 

(c) P(9.542 < x3, < y) =0.09 
(d) P(y < x3, < 48.232) =0.95 


7.5.4. For what value of n is each of the following state- 
ments true? 


(a) P(x?> 5.009) =0.975 
(b) P(27.204 < x? < 30.144) =0.05 
(c) P(x? < 19.281) =0.05 
(d) P(10.085 < x? < 24.769) =0.80 


7.5.5. For df values beyond the range of Appendix 
Table A.3, chi square cutoffs can be approximated by 
using a formula based on cutoffs from the standard nor- 
mal pdf, f7(z). Define x>,, and z* so that P(x? <x?) =p 
and P(Z <z*) = p, respectively. Then 


3 
ae Big HO 
Kan =O\1— 9 + Zev on 


Approximate the 95th percentile of the chi square dis- 
tribution with 200 df. That is, find the value of y for 
which 


Pe = y) = 0.95 


7.5.6. Let Y,, Y2,..., Y, be arandom sample of size n from 
a normal distribution having mean jz and variance o”. 
What is the smallest value of n for which the following is 


true? 
2 
P (5 < 2) > 0.95 
o 


(Hint: Use a trial-and-error method.) 


7.5.7. Start with the fact that (n — 1)S?/o? has a chi 
square distribution with n — 1 df (if the Y,’s are normally 
distributed) and derive the confidence interval formulas 
given in Theorem 7.5.1. 


7.5.8. A random sample of size n= 19 is drawn from a not- 
mal distribution for which o? = 12.0. In what range are we 
likely to find the sample variance, s*? Answer the question 
by finding two numbers a and b such that 


P(a<S* <b) =0.95 


7.5.9. How long sporting events last is quite variable. 
This variability can cause problems for TV broadcast- 
ers, since the amount of commercials and commentator 
blather varies with the length of the event. As an exam- 
ple of this variability, the table below gives the lengths 
for a random sample of middle-round contests at the 2008 
Wimbledon Championships in women’s tennis. 


Match Length (minutes) 
Cirstea-Kuznetsova 73 
Srebotnik-Meusburger 76 
De Los Rios-V. Williams 59 
Kanepi-Mauresmo 104 
Garbin-Szavay 114 
Bondarenko-Lisicki 106 
Vaidisova-Bremond 79 
Groenefeld-Moore 74 
Govortsova-Sugiyama 142 
Zheng-Jankovic 129 
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Perebiynis-Bammer 95 
Bondarenko-V. Williams 56 
Coin-Mauresmo 84 
Petrova-Pennetta 142 
Wozniacki-Jankovic 106 


Groenefeld-Safina 75 


Source: 2008.usopen.org/en_US/scores/cmatch/index.html?promo=t. 


(a) Assume that match lengths are normally distributed. 
Use Theorem 7.5.1 to construct a 95% _ confi- 
dence interval for the standard deviation of match 
lengths. 

(b) Use these same data to construct two one-sided 95% 
confidence intervals for o. 


7.5.10. How much interest certificates of deposit (CDs) 
pay varies by financial institution and also by length of 
the investment. A large sample of national one-year CD 
offerings in 2009 showed an average interest rate of 1.84 
and a standard deviation o = 0.262. A five-year CD ties 
up an investor’s money, so it usually pays a higher rate 
of interest. However, higher rates might cause more vari- 
ability. The table lists the five-year CD rate offerings from 
n= 10 banks in the northeast United States. Find a 95% 
confidence interval for the standard deviation of 5-year 
CD rates. Do these data suggest that interest rates for 
five-year CDs are more variable than those for one-year 
certificates? 


Bank Interest Rate (%) 
Domestic Bank 2.21 
Stonebridge Bank 2.47 
Waterfield Bank 2.81 
NOVA Bank 2.81 
American Bank 2.96 
Metropolitan National Bank 3.00 
AIG Bank 3.35 
iGObanking.com 3.44 
Discover Bank 3.44 
Intervest National Bank 3.49 


Source: Company reports. 


7.5.11. In Case Study 7.5.1, the 95% confidence inter- 
val was constructed for o rather than for o”. In practice, 
is an experimenter more likely to focus on the standard 
deviation or on the variance, or do you think that both 
formulas in Theorem 7.5.1 are likely to be used equally 
often? Explain. 


7.5.12. (a) Use the asymptotic normality of chi square 
random variables (see Question 7.3.6) to derive 
large-sample confidence interval formulas for o 
and o”. 
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(b) Use your answer to part (a) to construct an approxi- 
mate 95% confidence interval for the standard devi- 
ation of estimated potassium-argon ages based on 
the 19 y,’s in Table 7.5.1. How does this confidence 
interval compare with the one in Case Study 7.5.1? 


7.5.13. If a 90% confidence interval for o” is reported to 
be (51.47, 261.90), what is the value of the sample standard 
deviation? 


7.5.14. Let Y,, Y3, an 
from the pdf 


.,Y¥, be a random sample of size n 


1 
fro= (=) er? y>0; @>0 


(a) Use moment-generating functions to show that the 
ratio 2nY /6 has a chi square distribution with 2n df. 

(b) Use the result in part (a) to derive a 100(1 — a)% 
confidence interval for 6. 


7.5.15. Another method for dating rocks was used before 
the advent of the potassium-argon method described in 
Case Study 7.5.1. Because of a mineral’s lead content, 
it was capable of yielding estimates for this same time 
period with a standard deviation of 30.4 million years. The 
potassium-argon method in Case Study 7.5.1 had a smaller 
sample standard deviation of V733.4 = 27.1 million years. 
Is this “proof” that the potassium-argon method is more 
precise? Using the data in Table 7.5.1, test at the 
0.05 level whether the potassium-argon method has a 
smaller standard deviation than the older procedure using 
lead. 


7.5.16. When working properly, the amounts of cement 
that a filling machine puts into 25-kg bags have a stan- 
dard deviation (c) of 1.0kg. In the next column are the 
weights recorded for thirty bags selected at random from 
a day’s production. Test Hj: o? = 1 versus H;: 07 > 1 using 


the a =0.05 level of significance. Assume that the weights 
are normally distributed. 


26.18 24.22 24.22 
25.30 26.48 24.49 
25.18 23.97 25.68 
24.54 25.83 26.01 
25.14 25.05 25.50 
25.44 26.24 25.84 
24.49 25.46 26.09 
25.01 25.01 25.21 
25.12 24.71 26.04 
25.67 25.27 25.23 


Use the following sums: 
30 30 


- y; = 758.62 and iS y? = 19,195.7938 


i=l i=l 


7.5.17. A stock analyst claims to have devised a mathe- 
matical technique for selecting high-quality mutual funds 
and promises that a client’s portfolio will have higher aver- 
age ten-year annualized returns and lower volatility; that 
is, a smaller standard deviation. After ten years, one of the 
analyst’s twenty-four-stock portfolios showed an average 
ten-year annualized return of 11.50% and a standard devi- 
ation of 10.17%. The benchmarks for the type of funds 
considered are a mean of 10.10% and a standard deviation 
of 15.67%. 


(a) Let wu be the mean for a twenty-four-stock portfo- 
lio selected by the analyst’s method. Test at the 0.05 
level that the portfolio beat the benchmark; that is, 
test Ho: 4 = 10.1 versus H,: u > 10.1. 

Let o be the standard deviation for a twenty-four- 
stock portfolio selected by the analyst’s method. 
Test at the 0.05 level that the portfolio beat the 
benchmark; that is, test Hy:o = 15.67 versus H,:0 < 
15.67. 


(b) 


7.6 Taking a Second Look at Statistics (Type II Error) 


For data that are normal, and when the variance o? is known, both Type I errors and 
Type II errors can be determined, staying within the family of normal distributions. 
(See Example 6.4.1, for instance.) As the material in this chapter shows, the situation 
changes radically when o? is not known. With the development of the Student rf 
distribution, tests of a given level of significance a can be constructed. But what is 
the Type II error of such a test? 

To answer this question, let us first recall the form of the test statistic and critical 


region testing, for example, 


Ho: “= Lo Versus Hy: > Lo 


Example 
7.6.1 
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The null hypothesis is rejected if 
is ul = tan-1 
S//n ; 
The probability of the Type II error, 6, of the test at some value jz, > [Ao is 


Y— Mo 
P Ses 
( S/Jn a,n ) 
However, since j1p is not the mean of Y under Hy, the distribution of TH is not 
Student t. Indeed, a new distribution is called for. 


The following algebraic manipulations help to place the needed density into a 
recognizable form. 


= = y- = Y—m (H1-Ho) 
Y=—po Y-pit(ei- bo) _ A+ aia _ Glyn MeN 
S/J/n S/J/n S/Jn S/o 
oO 
Y-u (“1-H0) Y-u 
ahfe aha. ln? 28 
(n—1)S2/o2 (n—1)S2/o2 V U 
n—1 n—1 n—1 
where Z = is normal, U = i= D8 is a chi square variable with n — 1 degrees of 
o/J/n o 
freedom, and 6 = om is an (unknown) constant. Note that the random variable 


4+* differs from the Student ¢ with n — 1 degrees of freedom —<- only because of 


n-1 n—-1 


the additive term 6 in the numerator. But adding 6 changes the nature of the pdf 
significantly. 

An expression of the form 4+* is said to have a noncentral t distribution with 

n—-1 
n— | degrees of freedom and noncentrality parameter 6. 

The probability density function for a noncentral t variable is now well known 
(97). Even though there are computer approximations to the distribution, not know- 
ing o* means that 4 is also unknown. One approach often taken is to specify the 
difference between the true mean and the hypothesized mean as a given proportion 
of o. That is, the Type II error is given as a function of “—“° rather than j;. In some 
cases, this quantity can be approximated by “'—"°. 


The following numerical example will help to clarify these ideas. 


Suppose we wish to test Ho: 4 = Wo versus Hi : 4 > (Wo at the a = 0.05 level of sig- 
nificance. Let n = 20. In this case the test is to reject Hp if the test statistic van is 
greater than ts5,19 = 1.7291. What will be the Type II error if the mean has shifted 
by 0.5 standard deviation to the right of 19? 

Saying that the mean has shifted by 0.5 standard deviation to the right of jo 
is equivalent to setting “—“° = 0.5. In that case, the noncentrality parameter is 
§ = HH = (0.5)-/20=2.236. 


The probability of a Type II error is 


P(T\9,2.236 < 1.7291) 


where 7i9,2.236 is a noncentral t variable with 19 degrees of freedom and noncentral- 
ity parameter 2.236. 
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To calculate this quantity, we need the cdf of Tio,2.236. Fortunately, many statis- 
tical software programs have this function. The Minitab commands for calculating 
the desired probability are 


MTB > CDF 1.7291; 
SUBC > T 19 2.236 


with output 
Cumulative Distribution Function 
Student’s t distribution with 19 DF and noncentrality parameter 2.236 


x P(X <= x) 
1.7291 0.304828 


The sought-after Type IJ error to three decimal places is 0.305. 


Simulations 


As we have seen, with enough distribution theory, the tools for finding Type II errors 
for the Student ¢ test exist. Also, there are noncentral chi square and F distributions. 

However, the assumption that the underlying data are normally distributed is 
necessary for such results. In the case of Type I errors, we have seen that the t test is 
somewhat robust with regard to the data deviating from normality. (See Section 7.4.) 
In the case of the noncentral t, dealing with departures from normality presents 
significant analytical challenges. But the empirical approach of using simulations 
can bypass such difficulties and still give meaningful results. 

To start, consider a simulation of the problem presented in Example 7.6.1. Sup- 
pose the data have a normal distribution with 449 = 5 and o = 3. The sample size is 
n= 20. Suppose we want to find the Type I error when the true 6 = 2.236. For the 
given o = 3, this is equivalent to 


Hi-Mo  bi—S 
2.236 = = 
o//n 3/4/20 


or “4; =6.5. 

A Type IT error occurs if the test statistic is less than 1.7291. In this case, Ho 
would be accepted when rejection is the proper decision. 

Using Minitab, two hundred samples of size 20 from the normal distribution 
with jp =6.5 and o* =9 are generated: Minitab produces a 200 x 20 array. For each 


row of the array, the test statistic aa is calculated and placed in Column 21. If this 


value is less than 1.7291, a 1 is placed in that row of Column 22; otherwise a 0 goes 
there. The sum of the entries in Column 22 gives the observed number of Type II 
errors. Based on the computed value of the Type II error, 0.305, for the assumed 
value of 6, this observed number should be approximately 200(0.305) = 61. 

The Minitab simulation gave sixty-four observed Type II errors—a very close 
figure to what was expected. 

The robustness for Type II errors can lead to analytical thickets. However, sim- 
ulation can again shed some light on Type II errors in some cases. As an example, 
suppose the data are not normal, but gamma with r = 4.694 and A = 0.722. Even 
though the distribution is skewed, these values make the mean yz = 6.5 and the vari- 
ance o* = 9, as in the normal case above. Again relying on Minitab to give two 
hundred random samples of size 20, the observed number of Type II errors is sixty, 
so the test has some robustness for Type II errors in that case. Even though the data 
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are not normal, the key statistic in the analysis, y, will be approximately normal by 
the central limit theorem. 

If the distribution of the underlying data is unknown or extremely skewed, 
nonparametric tests, like the ones covered in Chapter 14 and in (28) are advised. 


Appendix 7.4.1 Minitab Applications 


Figure 7.A.1.1 


Many statistical procedures, including several featured in this chapter, require 
that the sample mean and sample standard deviation be calculated. Minitab’s 
DESCRIBE command gives ¥ and s, along with several other useful numerical char- 
acteristics of a sample. Figure 7.A.1.1 shows the DESCRIBE input and output for 
the twenty observations cited in Example 7.4.1. 


MTB > set cl 

DATA > 2.5 3.2 0.5 0.4 0.3 0.1 0.1 0.2 7.4 8.6 0.2 0.1 
DATA > 0.4 1.8 0.3 1.3 1.4 11.2 2.1 10.1 

DATA > end 

MTB > describe ci 


Descriptive Statistics: C1 


Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum 
ci 20 O 2.610 0.809 3.617 0.100 0.225 0.900 3.025 11.200 
Here, 
N = sample size 
N* = number of observations missing from cl (that is, the 
number of “interior” blanks) 
Mean = sample mean = y 
SE Mean = standard error of the mean = a 
StDev = sample standard deviation = s 
Minimum = smallest observation 
Q1 = first quartile = 25th percentile 
Median = middle observation (in terms of magnitude), or 
average of the middle two if n is even 
Q3 = third quartile = 75th percentile 
Maximum = largest observation 


Describing Samples Using Minitab Windows 


1. Enter data under Cl in the WORKSHEET. Click on STAT, then on BASIC 
STATISTICS, then on DISPLAY DESCRIPTIVE STATISTICS. 
2. Type Cl in VARIABLES box; click on OK. 


Percentiles of chi square, t, and F distributions can be obtained using the 
INVCDF command introduced in Appendix 3.A.1. Figure 7.A.1.2 shows the syntax 
for printing out teu= 12.5916) and F'o1,4,7(= 0.0667746). 
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Figure 7.A.1.2 


Figure 7.A.1.3 


Figure 7.A.1.4 


MTB > invcdf 0.95; 
SUBC > chisq 6. 


Inverse Cumulative Distribution Function 


Chi-Square with 6 DF 


P(X <= x) x 
0.95 12.5916 


MTB > invcdf 0.01; 
SUBC> f 4 7. 


Inverse Cumulative Distribution Function 


F distribution with 4 DF in numerator and 7 DF in denominator 


P(X <= x) x 
0.01 0.0667746 


To find Student ¢ cutoffs, the t,,-; notation needs to be expressed as a 
percentile. We have defined f 10,13, for example, to be the value for which 


P(T13 = t.10,13) = 9.10 
In the terminology of the INVCDF command, though, t19,13(= 1.35017) is the 
ninetieth percentile of the f7,,(t) pdf (see Figure 7.4.1.3). 


MTB > invcdf 0.90; 
SUBC> t 13. 


Inverse Cumulative Distribution Function 


Student’s t distribution with 13 DF 


P(X <= x) x 
0.9 1.35017 


The Minitab command for constructing a confidence interval for  (Theo- 
rem 7.4.1) is “TINTERVAL X Y,” where X denotes the desired value for the 
confidence coefficient 1—a@ and Y is the column where the data are stored. 
Figure 7.A.1.4 shows the TINTERVAL command applied to the bat data from Case 
Study 7.4.1; 1 —q@ is taken to be 0.95. 


MTB > set cl 

DATA > 62 52 68 23 34 45 27 42 83 56 40 
DATA > end 

MTB > tinterval 0.95 ci 


One-Sample T: C1 


Variable N Mean StDev SE Mean 95% CI 
C1 11 48.36 18.08 5.45 (36.21, 60.51) 


Constructing Confidence Intervals Using Minitab Windows 


Enter data under Cl in the WORKSHEET. 

2. Click on STAT, then on BASIC STATISTICS, then on 1-SAMPLE T. 

3. Enter Cl in the SAMPLES IN COLUMNS box, click on OPTIONS, and enter 
the value of 100(1 — a) in the CONFIDENCE LEVEL box. 

4. Click on OK. Click on OK. 


= 


Figure 7.A.1.5 shows the input and output for doing a ¢ test on the approval data 
given in Table 7.4.3. The basic command is “TTEST X Y,” where X is the value of 
/4. and Y is the column where the data are stored. If no other punctuation is used, 


Figure 7.A.1.5 
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MTB > set cl 

DATA > 59 65 69 53 60 53 58 64 46 67 51 59 
DATA > end 

MTB > ttest 62 cl 


One-Sample T: C1 


Test of mu = 62 vs not = 62 


Variable N Mean StDev SE Mean 95% CI T P 
C1 12 58.66 6.95 2.01 (54.25,63.08) -1.66 0.125 


the program automatically takes H, to be two-sided. If a one-sided test to the right 
is desired, we write 


MTB > ttest X Y; 
SUBC > alternative +1. 


For a one-sided test to the left, the subcommand becomes “alternative —1”. 

Notice that no value for a is entered, and that the conclusion is not phrased as 
either “Accept Hp” or “Reject Ho.” Rather, the analysis ends with the calculation of 
the data’s P-value. 

Here, 


P-value = P(T,; < —1.66) + P(T); > 1.66) 

= 0.0626 + 0.0626 

= 0.125 
(recall Definition 6.2.4). Since the P-value exceeds the intended a(= 0.05), the 
conclusion is “Fail to reject Ho.” 
Testing Hy: «=, Using Minitab Windows 

Enter data under Cl in the WORKSHEBFT. 
2. Click on STAT, then on BASIC STATISTICS, then on 1-SAMPLE T. 
3. Type Cl in SAMPLES IN COLUMNS box; click on PERFORM HYPOTH- 
ESIS TEST and enter the value of w,. Click on OPTIONS, then choose NOT 


EQUAL. 
4. Click on OK; then click on OK. 


= 


Appendix 7.A.2 Some Distribution Results for Y and S’ 


Theorem 
7.A.2.1 


Let Y,, Y2,..., Y, be a random sample of size n from a normal distribution with mean 
wand variance o*. Define 


= 1 n 1 n = 
Y=_-) Y; d S?=—— Y; -Y)? 
=p a mae ) 


i=1 


Then 


a. Y and S* are independent. 
2. 
b. oe has a chi square distribution with n — 1 degrees of freedom. 


Proof The proof of this theorem relies on certain linear algebra techniques as well 
as a change-of-variables formula for multiple integrals. Definition 7.A.2.1 and the 
Lemma that follows review the necessary background results. For further details, 
see (44) or (213). 


424 Chapter 7 Inferences Based on the Normal Distribution 


Definition 7.A.2.1. 


a. A matrix A is said to be orthogonal if AA’ = 1. 

b. Let 6 be any n-dimensional vector over the real numbers. That is, 6 = 
(C1, €2,..., Cn), Where each c; is a real number. The length of f will be 
defined as 


I| B a(t. 4.2)? 


(Note that || 6 ||?= 66".) 


Lemma a. A matrix A is orthogonal if and only if 


| AB |=I|B || foreach B 


b. Ifa matrix A is orthogonal, then det A= 1. 
c. Let g be a one-to-one continuous mapping on a subset, D, of n-space. Then 


Festa) day dag = | Cee tire 
g(D) D 


where J(g) is the Jacobian of the transformation. 


Set X; =(Y; — u)/o fori=1,2,...,n. Then all the X;’s are N(O, 1). Let A be an 
: : 1 41 1 y _ T 
n Xx n orthogonal matrix whose last row is (Fz. ee a): Let X = (X,..., Xn) 
and define Z = (Z,, Zo,..., Zn)’ by the transformation Z = AX. [Note that Z, = 
(<5)X1 feeb ()Xn= Vn X.] 
For any set D, 


P(Z€D)=P(AX € D)=P(X€ AD) 


where g(z) = A~!z. But A7! is orthogonal, so setting (x1,...,x,)’ = A7!z, we have 
that 


2 2 2 2 
Apa tte PN Spe ae Ze 
Thus 


fie = (29) /29-W(st+-43) 


ess, 


= 21)-"/2¢ (1/2)(22+--+22) 


From this we conclude that 
P(Ze p)= | (2 )-™/2—- (+422) day dey 
D 


implying that the Z;’s are independent standard normals. 
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Finally, 


n n—-1 n n 

2 2, .y2_ 2 Y)2 Lyx 
S27} 0k = Sox =) By 0k 
= j=l j=l j=l 


Therefore, 
n—-1 n 


z=) i - XY 
j=l 


= 


and X (and thus X) is independent of > (X; — X)*, so the conclusion fol- 
j=l 


n 


lows for standard normal variables. Also, since Y =oX + and )\(¥,;-Y)? = 
i=l 
a” >- (X; — X)’, the conclusion follows for N (1, 07) variables. 


i=] 


Comment As part of the proof just presented, we established a version of Fisher's 
lemma: 


Let X,, X.,..., X, be independent standard normal random variables and let A 
be an orthogonal matrix. Define (Z,,...,Z,)’ =A(X1,...,X,)". Then the Z;’s are 
independent standard normal random variables. 


Appendix 7.A.3 A Proof that the One-Sample t Test is a GLRT 


Theorem 
7.A.3.1 


The one-sample t test, as outlined in Theorem 7.4.2, is a GLRT. 


Proof Consider the test of Ho: w=, versus H\: u 4 4o. The two parameter spaces 
restricted to Hp and Hp U H; —that is, w and Q, respectively —are given by 


> <0} 


w={(u,07): w="; O<a 
and 
Q={(", 07): -—-WO<"<O; 0<0? <oo} 


Without elaborating the details (see Example 5.2.4 for a very similar problem), it 
can be readily shown that, under w, 


1 n 

2 2 

Me=(o and Op DUO) 
= 


Under Q, 


Therefore, since 


, a, IO(y-Khy 
Lu.o*)=(——) exp|-5 )(2=*) 


direct substitution gives 


Jn et/2 
V2 1) (yi — Mo)? 
i=1 


L(@) a 
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n/2 

ne! 
n 

2a >) Oi = Bo)" 
i=l 


and 
n/2 


L(Q.) = . 
20 Y=)? 
= 


From L(,) and L(Q.) we get the likelihood ratio: 


n n/2 
(yi — y)* 
= L(@) ie x Neo eal 
m4 L Q. "— n ’ = 
ENS Gio)? 
i=1 


As is often the case, it will prove to be more convenient to base a test on a monotonic 
function of 4, rather than on A itself. We begin by rewriting the ratio’s denominator: 


n 


Y5 Oi — M0)” = SIG - V+ G — wo)? 
i=l 


i=1 


=>) 01-9) +n(— bo)? 


i=1 


Therefore, 
—n/2 
Sn OND 
ee Ge wad Ho) 
>~ Gi -— y)? 
i=l 
1+ py 
= n—1 
where 
jens Y— Ho 
s/J/n 


Observe that as t” increases, 4 decreases. This implies that the original GLRT— 
which, by definition, would have rejected Hp for any 4 that was too small, say, less 
than 4*—is equivalent to a test that rejects Hy whenever f? is too large. But ¢ is an 
observation of the random variable 


Y — Lo 
T= 
S/Jn 


Thus “too large” translates numerically into ty/2,n—1: 


(= T,-1 by Theorem 7.3.5) 


O0<dA s we > i = Guiana 
But 
2. 2 
t = (tw/2,n—1) et < —la/2.n-1 or t= tu/2,n—1 


and the theorem is proved. 
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Appendix 7.4.4 A Proof of Theorem 7.5.2 


We begin by considering the test of Hy:o? =o? against a two-sided H,. The relevant 
parameter spaces are 


w={(u,0°): -O<UL<oN, a’ =a5} 
and 


Q={(",07): —co<p<oo, O<o07} 


In both, the maximum likelihood estimate for yw is y. In w, the maximum likeli- 
hood estimate for o? is simply 04; in Q, 0? = (1/n) >> (y; — y)* (see Example 5.4.4). 
i=l 


Therefore, the two likelihood functions, maximized over w and over Q, are 


L( )=( 1 ie “hy (22) 
Oe 208 Sap 2 00 


and 
2 
n/2 
L(Q.) = — exp 5 yi-y 
2x (i — yy)? i=l ry aye 
> dX (yi — y) 
n/2 
_ : ie end 
2x Oi - y)” 
i=l 


It follows that the generalized likelihood ratio is given by 


_ L(a@) 
~ L(Q) 


n/2 


ier ~7)? 


1(y-y an 
=) SS eX + 
no, | pa 00 ) 2 


2\ n/2 
= (25) ; o-n/2)( 02/05) +n/2 


2 
o9 


We need to know the behavior of 4, considered as a function of (o2/08). For 
simplicity, let x = (02/03). Then A=x"/2e~"/”)*+"/2 and the inequality A < A* is equiv- 
alent to xe~* < e7!(a*)*/". The right-hand side is again an arbitrary constant, say, 
k*. Figure 7.A.4.1 is a graph of y = xe~*. Notice that the values of x = (02/03) for 
which xe~* <k*, and equivalently A < 4*, fall into two regions, one for values of 
o2/o%, close to 0 and the other for values of o2/o2 much larger than 1. Accord- 
ing to the likelihood ratio principle, we should reject Ho for any A < A*, where 
P(A <A*|Ho) =a. But d* determines (via k*) numbers a and b, so the critical region 
is C ={(02/0) : (02/0$) < aor (02/0) > bd}. 
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Figure 7.A.4.1 


Comment At this point it is necessary to make a slight approximation. Just because 
P(A <A*|Ho) =a, it does not follow that 


Qya-y |]. fasa-ry 
p| —>.__«<a|===P >b 
on 2. oni 


and, in fact, the two tails of the critical regions will not have exactly the same 
probability. Nevertheless, the two are numerically close enough that we will not 
substantially compromise the likelihood ratio criterion by setting each one equal 
to a/2. 


Note that 
(4) -¥Y Swe 
=I i=l 


5 <a|=P |} —— —\ <na 
59 a) 


P 


and, similarly, 


Thus we will choose as critical values x2 /2.n-1 and Tin nt and reject Hp if either 


(n—1)s? , 
a, ce s Xa/2.n—-1 


or 
(n — 1)s? 9 


2 Ee X1-a/2,n-1 


(see Figure 7.A.4.2). 
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Figure 7.A.4.2 


x fe _, distribution 


Density 


N12 


py) 2 
Xa/2,n-1 X1-al2,n-1 


Reject n-— ee Ah 


Comment One-sided tests for dispersion are set up in a similar fashion. In the 


case of 
feet pe 
Hy:0° =05 
versus 
ee 2 
H:0° <9 


Hp is rejected if 


2 3 Xan-1 

99 
For 

ape’) 2 
Ho:0° =05 
versus 

ee) 2 
M:0° > 05 


Ap is rejected if 


(n— 1)s? 9 


2 =—Al—-a,n-1 
%% 
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TYPES OF DaTa: A BRIEF 


OVERVIEW 


8.1 
8.2 


Introduction 
Classifying Data 


8.3 Taking a Second Look at Statistics 
(Samples Are Not “Valid”!) 


The practice of statistics is typically conducted on two distinct levels. Analyzing data 
requires first and foremost an understanding of random variables. Which pdfs are 
modeling the observations? What parameters are involved, and how should they be 
estimated? Broader issues, though, need to be addressed as well. How is the entire set 
of measurements configured? Which factors are being investigated; in what ways are 
they related? Altogether, seven different types of data are profiled in Chapter 8. 
Collectively, they represent a sizeable fraction of the “experimental designs” that 
many researchers are likely to encounter. 


8.1 Introduction 


Chapters 6 and 7 have introduced the basic principles of statistical inference. The 
typical objective in that material was either to construct a confidence interval or 
to test the credibility of a null hypothesis. A variety of formulas and decision 
rules were derived to accommodate distinctions in the nature of the data and the 
parameter being investigated. It should not go unnoticed, though, that every set of 
data in those two chapters, despite their superficial differences, shares a critically 
important common denominator—each represents the exact same experimental 
design. 

A working knowledge of statistics requires that the subject be pursued at two 
different levels. On one level, attention needs to be paid to the mathematical prop- 
erties inherent in the individual measurements. These are what might be thought of 
as the “micro” structure of statistics. What is the pdf of the Y;’s? Do we know E(Y;) 
or Var(Y;)? Are the Y;’s independent? 

Viewed collectively, though, every set of measurements also has a certain overall 
structure, or design. It will be those “macro” features that we focus on in this chapter. 
A number of issues need to be addressed. How is one design different from another? 
Under what circumstances is a given design desirable? Or undesirable? How does 
the design of an experiment influence the analysis of that experiment? 

The answers to some of these questions will need to be deferred until each 
design is taken up individually and in detail later in the text. For now our objective 
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is much more limited— Chapter 8 is meant to be a brief introduction to some of the 
important ideas involved in the classification of data. What we learn here will serve 
as a backdrop and a frame of reference for the multiplicity of statistical procedures 
derived in Chapters 9 through 14. 


Definitions 


To describe an experimental design, and to distinguish one design from another, 
requires that we understand several key definitions. 


Factors and Factor Levels The word factor is used to denote any treatment or 
therapy “applied to” the subjects being measured or any relevant feature (age, 
sex, ethnicity, etc.) “characteristic” of those subjects. Different versions, extents, or 
aspects of a factor are referred to as levels. 


Case Study 8.1.1 


Generations of athletes have been cautioned that cigarette smoking impedes 
performance. One measure of the truth of that warning is the effect of smoking 
on heart rate. In one study (73), six nonsmokers, six light smokers, six moderate 
smokers, and six heavy smokers each engaged in sustained physical exercise. 
Table 8.1.1 lists their heart rates after they had rested for three minutes. 


Table 8.1.1 Heart Rates 
Light Moderate Heavy 
Nonsmokers Smokers Smokers Smokers 

69 55 66 91 

52 60 81 72 

71 78 70 81 

58 58 77 67 

59 62 57 95 

65 66 719 84 
Averages: 62.3 63.2 71.7 81.7 


The single factor in this experiment is smoking, and its levels are the four 
different column headings in Table 8.1.1. A more elaborate study addressing 
this same concern about smoking could easily be designed to incorporate 
three factors. Common sense tells us that the harmful effects of smoking may 
not be the same for men as they are for women, and they may be more (or 
less) pronounced in senior citizens than they are in young adults. As a factor, 
gender would have two levels, male and female, and age could easily have 
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(Case Study 8.1.1 continued) 


at least three—for example, 18-34, 35-64, and 65+. If all three factors were 
included, the format of the data table would look like Figure 8.1.1. 


Light Smokers 


M 


F M 


Moderate Smokers 


F 


Heavy Smokers 


M F 


Nonsmokers 
M F 
18-34 
Age 35-64 
65+ 


Figure 8.1.1 


Blocks Sometimes subjects or environments share certain characteristics that affect 
the way that levels of a factor respond, yet those characteristics are of no intrin- 
sic interest to the experimenter. Any such set of conditions or subjects is called a 


block. 


Case Study 8.1.2 


beef, and bread. 


Table 8.1.2 summarizes the results of a rodent-control experiment that was car- 
ried out in Milwaukee, Wisconsin, over a period of ten weeks. The study’s single 
factor was rat poison flavor, and it had four levels—plain, butter-vanilla, roast 


Survey Number 


Plain Butter-Vanilla Roast Beef Bread 


Table 8.1.2 Bait-Acceptance Percentages 


ABWN rR 


13.8 
12.9 
25.9 
18.0 
15.2 


11.7 
16.7 
29.8 
23.1 
20.2 


Eight hundred baits of each flavor were placed around garbage-storage 
areas. After two weeks, the percentages of baits taken were recorded. For the 
next two weeks, another set of 3200 baits were placed at a different set of loca- 
tions, and the same protocol was followed. Altogether, five two-week “surveys” 
were completed (85). 

Clearly, each survey created a unique experimental environment. Baits 
were placed at different locations, weather conditions would not be the same, 
and the availability of other sources of food might change. For those reasons 
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and maybe others, Survey 3, for example, yielded percentages noticeably higher 
than those of Surveys 1 and 2. The experimenters’ sole objective, though, was 
to compare the four flavors—which did the rodents prefer? The fact that the 
survey “environments” were not identical was both anticipated and irrelevant. 
The five different surveys, then, qualify as blocks. 


About the Data To an applied statistician, the data in Table 8.1.2 would be clas- 
sified as a complete block experiment, because the entire set of factor levels was 
compared within each block. Sometimes physical limitations prevent that from 
being possible, and only subsets of factor levels can appear in a given block. Experi- 
ments of that sort are referred to as incomplete block designs. Not surprisingly, they 
are much more difficult to analyze. 


Independent and Dependent Observations Whatever the context, measure- 
ments collected for the purpose of comparing two or more factor levels are nec- 
essarily either dependent or independent. Two or more observations are dependent 
if they share a particular commonality relevant to what is being measured. If there 
is no such linkage, the observations are independent. 

An example of dependent data is the acceptance percentages recorded in 
Table 8.1.2. The 13.8, for example, shown in the upper-left-hand corner is mea- 
suring both the rodents’ preference for plain baits and also the environmen- 
tal conditions that prevailed in Survey 1; similarly, the observation immediately 
to its right, 11.7, measures the rodents’ preference for the butter-vanilla fla- 
vor and the same survey environmental conditions. By definition, then, 13.8 and 
11.7 are dependent measurements because their values have the commonality 
of sharing the same conditions of Survey 1. Taken together, then, the data in 
Table 8.1.2 are five sets of dependent observations, each set being a sample of 
size 4. 

By way of contrast, the observations in Table 8.1.1 are independent. The 69 and 
55 in the first row, for example, have nothing exceptional in common—they are 
simply measuring the effects of two different factor levels applied to two differ- 
ent people. Would the first two entries in the first column, 69 and 52, be considered 
dependent? No. Simply sharing the same factor level does not make observations 
dependent. 

For reasons that will be examined in detail in later chapters, factor levels can 
often be compared much more efficiently with dependent observations than with 
independent observations. Fortunately, dependent observations come about quite 
naturally in a number of different ways. Measurements made on twins, siblings, 
or littermates are automatically dependent because of the subjects’ shared genetic 
structure (and, of course, repeated measurements taken on the same individual are 
dependent). In agricultural experiments, crops grown in the same general location 
are dependent because they share similar soil quality, drainage, and weather con- 
ditions. Industrial measurements taken with the same piece of equipment or by 
the same operator are likewise dependent. And, of course, time and place (like 
the surveys in Table 8.1.2) are often used to induce shared conditions. Those are 
some of the “standard” ways to make observations dependent. Over the years, 
experimenters have become very adept at finding clever, “nonstandard” ways 
as well. 
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Similar and Dissimilar Units Units also play a role in a data set’s macrostructure. 
Two measurements are said to be similar if their units are the same and dissimilar 
otherwise. Comparing the effects of different factor levels is the typical objective 
when the units in a set of data are all the same. This was the situation in both Case 
Studies 8.1.1 and 8.1.2. Dissimilar measurements are analyzed by quantifying their 
relationship. 


Quantitative Measurements and Qualitative Measurements Measurements are 
considered quantitative if their possible values are numerical. The heart rates in 
Table 8.1.1 and the bait-acceptance percentages in Table 8.1.2 are two examples. 
Qualitative measurements have “values” that are either categories, characteristics, 
or conditions. 


Case Study 8.1.3 


Certain viral infections contracted during pregnancy —particularly early in the 
first trimester—can cause birth defects. By far the most dangerous of these are 
Rubella infections, also known as German measles. Table 8.1.3 (45) summarizes 
the history of 578 pregnancies, each complicated by a Rubella infection either 
“early” (first trimester) or “late” (second and third trimesters). 


Table 8.1.3 


When Infection Occurred 


Early Late 
Abnormal birth 59 27 
Outcome 
Normal birth 143 349 
% of abnormal births 29.2 12 


Despite all the numbers displayed in Table 8.1.3, these are not quantitative 
measurements. What we are seeing is a summary of qualitative measure- 
ments. When the data were originally recorded, they would have looked like 
Figure 8.1.2. The qualitative time variable had two values (early or late), as did 
the qualitative outcome variable (normal or abnormal). 


Patient no. Name Time of Infection Birth outcome 


1 ML Early Abnormal 

2 JG Late Normal 

3 DF Early Normal 
578 CW Early Abnormal 


Figure 8.1.2 
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Possible Designs 


The definitions just cited can give rise to a sizeable number of different experimental 
designs, far more than can be covered in this text. Still, the number of designs that 
are widely used is fairly small. Much of the data likely to be encountered fall into 
one of the following seven formats: 


One-sample data 
Two-sample data 
k-sample data 

Paired data 
Randomized block data 
Regression data 
Categorical data 


The heart rates listed in Table 8.1.1, for example, qualify as k-sample data; 
the rodent bait acceptances in Table 8.1.2 are randomized block data; and the 
Rubella/pregnancy outcomes in Table 8.1.3 are categorical data. 

In Section 8.2, each design will be profiled, illustrated, and reduced to a math- 
ematical model. Special attention will be given to each design’s objectives —that is, 
for what type of inference is it likely to be used? 


8.2 Classifying Data 


The answers to no more than four questions are needed to classify a set of data as 
one of the seven basic models listed in the preceding section: 


1. Are the observations quantitative or qualitative? 
2. Are the units similar or dissimilar? 

3. How many factor levels are involved? 

4. Are the observations dependent or independent? 


In Section 8.2, we use these four questions as the starting point in distinguishing one 
experimental design from another. 


One-Sample Data 


The simplest of all experimental designs, the one-sample data design, consists of 
a single random sample of size n. Necessarily, the n observations reflect one par- 
ticular set of conditions or one specific factor. During presidential election years, 
a familiar example (probably too familiar...) is the political poll. A random sam- 
ple of n voters, all representing the same demographic group, are asked whether 
they intend to vote for Candidate X—1 for yes, 0 for no. Recorded, then, are the 
outcomes of n Bernoulli trials, where the unknown parameter p is the true propor- 
tion of voters in that particular demographic constituency who intend to support 
Candidate X. 

Other discrete random variables can also appear as one-sample data. Recall 
Case Study 4.2.3, describing the outbreaks of war from 1500 to 1931. Those 432 
observations were shown to follow a Poisson distribution. In practice, though, one- 
sample data will more typically consist of measurements on a continuous random 
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variable. In Case Study 4.2.4, the sample of thirty-six intervals between consecutive 
eruptions of Mauna Loa had a distribution entirely consistent with an exponential 
random variable. 

All these examples notwithstanding, by far the most frequently encountered 
set of assumptions associated with one-sample data is that the Y;’s are a random 
sample of size n from a normal distribution with unknown mean pw and unknown 
standard deviation o. Possible inference procedures would be either hypothesis 
tests or confidence intervals for 4 and/or o, whichever would be appropriate for 
the experimenter’s objectives. 

In describing experimental designs, the assumptions given for a set of mea- 
surements are often written in the form of a model equation, which, by definition, 
expresses the value of an aribitrary Y; as the sum of fixed and variable components. 
For one-sample data, the usual model equation is 


Y,=U+6&, i=1,2,...,n 


where «¢; is a normally distributed random variable with mean 0 and standard 
deviation o. 


Case Study 8.2.1 


Inventions, whether simple or complex, can take a long time to become mar- 
ketable. Minute Rice, for example, was developed in 1931 but appeared for 
the first time on grocery shelves in 1949, some eighteen years later. Listed in 
Table 8.2.1 are the conception dates and realization dates for seventeen familiar 
products (197). Computed for each and shown in the last column is the product’s 
development time, y. In the case of Minute Rice, y = 18 (= 1949 — 1931). 


Table 8.2.1 
Conception Realization Development 
Invention Date Date Time (years) 
Automatic transmission 1930 1946 16 
Ballpoint pen 1938 1945 7 
Filter cigarettes 1953 1955 2 
Frozen foods 1908 1923 15 
Helicopter 1904 1941 37 
Instant coffee 1934 1956 22 
Minute Rice 1931 1949 18 
Nylon 1927 1939 12 
Photography 1782 1838 56 
Radar 1904 1939 35 
Roll-on deodorant 1948 1955 7 
Telegraph 1820 1838 18 
Television 1884 1947 63 
Transistor 1940 1956 16 
VCR 1950 1956 6 
Xerox copying 1935 1950 15 
Zipper 1883 1913 30 
Average 22.2 
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About the Data In addition to exhibiting one-sample data, Table 8.2.1 is typical of 
the “fun list” format that appears so often in the print media. These are entertain- 
ment data more so than serious scientific research. Here, for example, the average 
development time is 22.2 years. Would it make sense to use that average as part of 
a formal inference procedure? Not really. If it could be assumed that these seven- 
teen inventions were in some sense a random sample of all possible inventions, then 
using 22.2 years to draw an inference about the “true” average development time 
would be legitimate. But the arbitrariness of the inventions included in Table 8.2.1 
makes that assumption highly questionable at best. Data like these are meant to be 
enjoyed and to inform, not to be analyzed. 


Two-Sample Data 


Two-sample data consist of two independent random samples of sizes m and n, each 
having quantitative, similar unit measurements. Each sample is associated with a 
different factor level. Sometimes the two samples are sequences of Bernoulli tri- 
als, in which case the measurements are 0’s and 1’s. Given that scenario, the data’s 
two parameters are the unknown “success” probabilities py and py, and the usual 
inference procedure would be to test Ho: px = py. 

Much more often, the two samples are normally distributed with possibly differ- 
ent means and possibly different standard deviations. If X1, X2,..., X, denotes the 
first sample and Yj, Y2,..., Yi the second, the usual model equation assumptions 
would be written 


Xj=pUyte;, i=1,2,...,n 
Yj=pyté;, jJ=1,2,...,m 


where ¢; is normally distributed with mean 0 and standard deviation ox, and gj is 
normally distributed with mean 0 and standard deviation oy. 

With two-sample data, inference procedures are more likely to be hypothesis 
tests than confidence intervals. A two-sample t test is used to assess the credibility of 
Ho : “x = Wy; an F test is used when the objective is to choose between Ho: ox = oy 
and, say, H\: 0x 4 oy. Both procedures will be described in Chapter 9. 

To experimenters, two-sample data address what is sometimes a serious flaw 
with one-sample data. The usual one-sample hypothesis test, Ho: u = 49, makes the 
tacit assumption that the Y;’s (whose true mean is 42) were collected under the same 
conditions that gave rise to the “standard” value jo, against which yz is being tested. 
There may be no way to know whether that assumption is true, or even remotely 
true. The two-sample format, on the other hand, lets the experimenter control the 
conditions (and subjects) under which both sets of measurements are taken. Doing 
so heightens the chances that the true means are being compared in a fair and 
equitable way. 


Case Study 8.2.2 


Forensic scientists sometimes have difficulty identifying the sex of a murder 
victim whose body is discovered badly decomposed. Often, dental structure can 
provide useful clues because female teeth and male teeth have different physical 
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Figure 8.2.1 


(Case Study 8.2.2 continued) 


and chemical characteristics. The extent to which X-rays can penetrate tooth 
enamel, for instance, is not the same for the two sexes. 

Table 8.2.2 lists the enamel spectropenetration gradients for eight male 
teeth and eight female teeth (57). These measurements have all the characteris- 
tics of the two-sample format: the data are quantitative, the units are similar, 
two factor levels (male and female) are involved, and the observations are 
independent. 


Table 8.2.2 Enamel Spectropenetration Gradients 
Male Female 
4.9 48 
5.4 5.3 
5.0 3.7 
5.5 4.1 
5.4 5.6 
6.6 4.0 
6.3 3.6 
4.3 5.0 
Averages: 5.4 4.5 


The sample averages are 5.4 for the male teeth and 4.5 for the female 
teeth. According to the two-sample t test introduced in Chapter 9, the difference 
between those two sample means is, indeed, statistically significant. 


About the Data In analyzing these data, the assumption would be made that the 
male gradients (X;’s) and the female gradients (Y;’s) are normally distributed. How 
do we know if that assumption is correct? We don’t. For large data sets—sample 
sizes of 30 or more—the assumption that observations are normally distributed can 
be investigated using a goodness-of-fit test, the details of which are presented in 
Chapter 10. For small samples like those in Table 8.2.2, the best that we can do is 
to plot the data along a horizontal line and see if the spacing is consistent with the 
shape of a normal curve. That is, does the pattern show signs of symmetry and is the 
bulk of the data near the center of the range? 


Male gradients 


3.0 4.0 5.0 6.0 7.0 


3.0 4.0 5.0 6.0 7.0 


8.2 Classifying Data 439 


Figure 8.2.1 shows two such graphs for the gradients listed in Table 8.2.2. 
By the criteria just mentioned, there is nothing about either sample that would 
be inconsistent with the assumption that both the X;’s and the Y;’s are normally 
distributed. 


k-Sample Data 


When more than two factor levels are being compared, and when the observations 
are quantitative, have similar units, and are independent, the measurements are said 
to be k-sample data. Although their assumptions are comparable, two-sample data 
and k-sample data are treated as distinct experimental designs because the ways they 
are analyzed are totally different. The ¢ test format that figures so prominently in the 
interpretation of one-sample and two-sample data cannot be extended to accommo- 
date k-sample data. A more powerful technique, the analysis of variance, is needed 
and will be the sole topic of Chapters 12 and 13. 

Their multiplicity of factor levels also requires that k-sample data be identified 
using double-subscript notation. The ith observation appearing in the jth factor 
level will be denoted Y;;, so the model equations take the form 


Vie=peypteq7.t=1,2,...np Jal 2,...,8 


where n; denotes the sample size associated with the jth factor level (nj +n2+-++++ 
ny =n), and ¢;; is a normally distributed random variable with mean 0 and the same 
standard deviation o for alli and j. 

The first step in analyzing k-sample data is to test Ho: wi = 2 =--- = x. Pro- 
cedures are also available for testing subhypotheses involving certain factor levels 
irrespective of all the others—in effect, fine-tuning the focus of the inferences. 


Case Study 8.2.3 


Many studies have been undertaken to document the directional changes over 
time in the Earth’s magnetic field. One approach compared the 1669, 1780, and 
1865 eruptions of Mount Etna. For each seismic event, the magnetic field in the 
resulting molten lava aligned itself with the Earth’s magnetic field as it prevailed 
at that time. When the lava cooled and hardened, the magnetic field was “cap- 
tured” and its direction remained fixed. Table 8.2.3 lists the declinations of the 
magnetic field measured in three blocks of lava, randomly sampled from each 
of those three eruptions (170). 


Table 8.2.3 Declination of Magnetic Field 
In 1669 In 1780 In 1865 


57.8 57.9 52.7 
60.2 55.2 53.0 
60.3 54.8 49.4 


Averages: 59.4 56.0 51.7 
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About the Data Every factor in every experiment is said to be either a fixed effect 
or a random effect—a fixed effect if the factor’s levels have been preselected by 
the experimenter, a random effect otherwise. Here “time” would be considered a 
random effect because its three levels —1669, 1780, and 1865—were not preselected. 
They were simply the times when the volcano erupted. Whether a factor is fixed or 
random does not affect the analysis of the experimental designs considered in this 
text. For more complicated, multifactor designs, though, the distinction is critical 
and often dictates the way an analysis proceeds. 


Paired Data 


In the two-sample and k-sample designs, factor levels are compared using indepen- 
dent samples. An alternative is to use dependent samples by grouping the subjects 
into n blocks. If only two factor levels are involved, the blocks are referred to as 
pairs, which gives the design its name. 

The responses to factor levels X and Y in the ith pair are recorded as X; and 
Y;, respectively. Whatever contributions to those values are due to the conditions 
prevailing in Pair i will be denoted P;. The model equations, then, can be written 


Xj=uxtPjt+e, i=1,2,...,n 
and 
Y;=uyt+ P;+6¢;, $=1, 2.55550 


where e; and ¢} are independent normally distributed random variables with mean 
0 and the same standard deviation o. The fact that P; is the same for both X; and Y; 
is what makes the samples dependent. 

The statistical objective of two-sample data and paired data is often the same. 
Both use ¢ tests to focus on the null hypothesis that the true means (j1x and py) 
associated with the two factor levels are equal. A paired-data analysis, though, tests 
Ho: bx = ty by defining zp = wx — py and testing Ho: fo = 0. In effect, a paired 
t test is a one-sample t test done on the set of within-pair differences, d; = x; — yi, 
| — ttl OP ee (1 

Some of the more common ways to form paired data have already been men- 
tioned on p. 433. A not-so-common application of one of those ways—time and 
place —is described in Case Study 8.2.4. 


Case Study 8.2.4 


There are many factors that predispose bees to sting (other than sheer orner- 
iness...). A person wearing dark clothing, for instance, is more likely to get 
stung than someone wearing white. And someone whose movements are quick 
and jerky runs a higher risk than does a person who moves more slowly. Still 
another factor—one particularly important to apiarists—is whether or not the 
person has just been stung by other bees. 

The influence of prior stings was simulated in an experiment by dangling 
eight cotton balls wrapped in muslin up and down in front of the entrance to 
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a hive (53). Four of the balls had just been exposed to a swarm of angry bees 
and were filled with stings; the other four were “fresh.” After a specified length 
of time, the number of new stings in each of the balls was counted. The entire 
procedure was repeated eight more times (see Table 8.2.4). 


Table 8.2.4 
Trial Cotton Balls Previously Stung Fresh Cotton Balls Difference 
1 27 33 —6 
2 9 9 0 
3 33 21 12 
4 33 15 18 
=) 4 6 —2 
6 21 16 5 
7 20 19 1 
8 33 15 18 
9 70 10 60 
Average: 11.8 


The last column in Table 8.2.4 gives the nine within-pair differences. The 
average of those differences is 11.8. The issue to be resolved—and what we 
need the paired r test to tell us—is whether the difference between 11.8 and 0 is 
statistically significant. 


About the Data When two factor levels are to be compared, experimenters often 
have the choice of using either the two-sample format or the paired-data format. 
That would be the case here. If the experimenter dangled previously stung cotton 
balls in front of the hive on, say, nine occasions and did the same with fresh cotton 
balls on nine other occasions, the two samples would be independent, and the data 
set would qualify as a two-sample design. 

Neither design is always better than the other, for a number of reasons detailed 
in Chapter 13. Sometimes, which is likely to be more effective is not obvious. The 
situation described in Case Study 8.2.4, though, is not one of those times! In general, 
the paired-data format is superior when excessive heterogeneity in the experimen- 
tal environment or among the subjects is present. Is that the case here? Definitely. 
Bees have a well-deserved reputation for erratic, Jekyll-and-Hyde-type behavior. 
All sorts of transient factors might conceivably influence their responses to balls 
dangling in front of their hive. The two-sample format would allow all of that 
trial-to-trial variability within the factor levels to obscure the difference between 
the factor levels. That would be a very serious drawback to using a two-sample 
design in this particular context. In contrast, by targeting the within-pair differences, 
the paired-data design effectively eliminates the component P; that appears in the 
model equations: 


Re VS pe Pe = Get Pe) Se Se He, 
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In short, the choice of an experimental design here is a no-brainer. The 
researchers who conducted this study did exactly what they should have 
done. 


Randomized Block Data 


Randomized block data have the same basic structure as paired data— quantitative 
measurements, similar units, and dependent samples; the only difference is that 
more than two factor levels are involved in randomized block data. Those additional 
factor levels, though, add a degree of complexity that the paired ¢ test is unable to 
handle. Like k-sample data, randomized block data require the analysis of variance 
for their interpretation. 

Suppose the data set consists of k factor levels, all of which are applied in each 
of b blocks. The model equation for Y;;, the observation appearing in the ith block 
and receiving the jth factor level, then becomes 


Yjj = Uj +B + ij, BH 1, Qo cong DE FHI 2k 


where 1; is the true average response associated with the jth factor level, B; is the 
portion of the value of Y;,; that can be attributed to the net effect of all the conditions 
that characterize Block i, and ¢;; is a normally distributed random variable with 
mean 0 and the same standard deviation o for alli and /. 


Case Study 8.2.5 


Table 8.2.5 summarizes the results of a randomized block experiment set up 
to investigate the possible effects of “blood doping,” a controversial proce- 
dure whereby athletes are injected with additional red blood cells for the 
purpose of enhancing their performance (15). Six runners were the subjects. 
Each was timed in three ten thousand-meter races: once after receiving extra 
red blood cells, once after being injected with a placebo, and once after receiv- 
ing no treatment whatsoever. Listed are their times (in minutes) to complete 
the race. 


Table 8.2.5 


Subject NoInjection Placebo Blood Doping 


1 34.03 34.53 33.03 
2 32.85 32.70 31.55 
3 33.50 33.62 32.33 
4 32.52 31.23 31.20 
5 34.15 32.85 32.80 
6 33.77 33.05 33.07 


Clearly, the times in a given row are dependent—all three reflect to some 
extent the inherent speed of the subject, regardless of which factor level might 
also be operative. Documenting differences from subject to subject, though, 
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would not be the objective for doing this sort of study. If 1, 42, and 43 denote 
the true average times characteristic of the no injection, placebo, and blood dop- 
ing factor levels, respectively, the experimenter’s first priority would be to test 


Ao: 1 = 2 = 3. 


About the Data The name randomized block derives from one of the properties 
that such data supposedly have—namely, that the factor levels within each block 
have been applied in a random order. To do otherwise—that is, to take the mea- 
surements in any sort of systematic fashion (however well intentioned) —is to create 
the opportunity for the observations to become biased. If that worst-case scenario 
should happen, the data are worthless because there is no way to separate the “fac- 
tor effect” from the “bias effect” (and, of course, there is no way to know for certain 
whether the data were biased in the first place). 

For the same reasons, two-sample data and k-sample data should be completely 
randomized, which means that the entire set of measurements should be taken in 
a random order. Figure 8.2.2 shows an acceptable measurement sequence for the 
performance times in Table 8.2.5 and the magnetic field declinations in Table 8.2.3. 


Subject No Injection Placebo Blood Doping 


NnRWNFR 
WWNNFrN 
PNrRRPN Ww 
NR WWWrR 


In 1669 In1780 In 1865 


3 5 
4 1 9 
7 2 


Regression Data 


All the experimental designs introduced up to this point share the property that 
their measurements have the same units. Moreover, each has had the same basic 
objective: to quantify or to compare the effects of one or more factor levels. In 
contrast, regression data typically consist of measurements with dissimilar units, and 
the objective with them is to study the functional relationship between the variables. 

Regression data often have the form (x;, Y;), i= 1,2,...,n, where x; is the 
value of an independent variable (typically preselected by the experimenter) and 
Y; is a dependent random variable (usually having units different from those of x;). 
A particularly important special case is the simple linear model, 


Y,;= Pot Bix, +6), i=1,2,...,n 
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Figure 8.2.3 


where ¢; is assumed to be normally distributed with mean 0 and standard devi- 
ation o. Here E(Y;) = Bo + fix;, but more generally, E(¥;) can be any function 
g(x;)—for example, 


E(Y;) = Box?! or E(¥;) = Boe? 


The details will not be presented in this text, but the simple linear model can be 
extended to include k independent variables. The result is a multiple linear regression 
model, 


Y; = Bo + Bixii + Boxoi +--+ + Beri tei, (= 1,2,...,0 


An important special case of the regression model occurs when the x;’s are 
not preselected by the experimenter. Suppose, for example, that the relationship 
between height and weight is to be studied in adult males. One way to collect a 
set of relevant data would be to choose a random sample of n adult males and 
record each subject’s height and weight. Neither variable in that case would be 
preselected or controlled by the experimenter: the height, X;, and the weight, 
Y;, of the ith subject are both random variables, and the measurements in that 
case—(X,, Y,), (Xo, Yo),..., (Xn, Y,)—are said to be correlation data. The usual 
assumption invoked for correlation data is that the (X, Y)’s are jointly distributed 
according to a bivariate normal distribution (see Figure 8.2.3). 


The implications of the independent variable being either preselected (x;) or 
random (X;) will be explored at length in Chapter 11. Suffice it to say that if the 
objective is to summarize the relationship between the two variables with a straight 
line, as it is in Figure 8.2.4, it makes absolutely no difference whether the data have 
the form (x;, Y;) or (X;, Y;)—the resulting equation will be the same. 


Case Study 8.2.6 


One of the most startling and profound scientific revelations of the twentieth 
century was the evidence, discovered in 1929 by the American astronomer 
Edwin Hubble, that the universe is expanding. Hubble’s research shattered 
forever the ancient belief that the heavens are basically in a state of cosmic 
equilibrium: quite the contrary, galaxies are receding from each other at mind- 
bending velocities (the cluster Hydra, for example, is moving away from other 
clusters at the rate of 38.0 thousand miles/sec). 


(Continued on next page) 


8.2 Classifying Data 445 


If y is a galaxy’s recession velocity (relative to that of any other galaxy) and 
x is its distance (from that other galaxy), Hubble’s law states that 
y=Hx 


where H is known as Hubble’s constant. Table 8.2.6 summarizes his findings — 
listed are distance and velocity determinations made for eleven galactic clus- 
ters (23). 


Table 8.2.6 

Distance, x Velocity, y 
Cluster (millions of light-years) (thousands of miles/sec) 
Virgo 22 0.75 
Pegasus 68 2.4 
Perseus 108 32 
Coma Berenices 137 4.7 
Ursa Major No. 1 255 9.3 
Leo 315 12.0 
Corona Borealis 390 13.4 
Gemini 405 14.4 
Bootes 685 24.5 
Ursa Major No. 2 700 26.0 
Hydra 1100 38.0 


For these data, the value H is estimated to be 0.03544 (using a technique 
covered in Chapter 11). Figure 8.2.4 shows that 


y =0.03544x 
fits the data exceptionally well. 


as 
o 
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Figure 8.2.4 


About the Data Techniques for measuring interstellar distances have been greatly 
refined since the 1920s when Hubble reported the data in Table 8.2.6. The most 
recent estimates yield a value for Hubble’s constant about a third as large as the 
slope shown in Figure 8.2.4. That particular adjustment is critical because the recip- 
rocal of Hubble’s constant can be used to calculate the age of the universe, or, at 
the very least, the time elapsed since the Big Bang [see (96)]. Based on the revised 
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data, the Big Bang occurred some fifteen billion years ago, a number that agrees 
with estimates found using other methods. 


Comment Look again at the graph of Hubble’s data in Figure 8.2.4. Which is the 
appropriate description of the eleven (distance, velocity) measurements—are they 
(x;, Y;)’s or (X;, Y;)’s? The answer is not obvious. At first glance, these would appear 
to be correlation data—distance (X) and velocity (Y) measurements having been 
made jointly on a random sample of eleven galactic clusters. Arguing against that 
conclusion is the spacing of the points. With correlation data, the bulk of the X mea- 
surements lie near the center of their range, which is not the case here. Perhaps 
the reason for the unusual spacing is a set of constraints imposed by the other- 
worldly nature of the data, or maybe it suggests that Hubble, for whatever reasons, 
preselected the clusters because of their distances. 


Categorical Data 


Suppose two qualitative, dissimilar variables are observed on each of n subjects, 
where the first variable has R possible values and the second variable, C possible 
values. We call such measurements categorical data. 

The number of times each value of one variable occurs with each value of the 
other variable is typically displayed in a contingency table, which necessarily has R 
rows and C columns. Whether the two variables are independent is the question that 
an experimenter can use categorical data to answer. 


Case Study 8.2.7 


Is there a relationship between a physician’s malpractice history (X) and his or 
her specialty (Y)? Three “values” of X were looked at, as well as three “values” 
of Y (29): 
no malpractice claims 
X= 4 one or more claims ending in damages awarded 
one or more claims filed but none requiring compensation 


orthopedic surgery 
Y= obstetrics-gynecology 
internal medicine 


A total of 1942 physicians comprised the sample. The resulting (X, Y) values 
are summarized in the contingency table shown in Figure 8.2.5. 


Orth. Surg. Ob-Gyn Int.Med. Totals 


No claims 147 349 709 1205 
At least one claim lost 106 149 62 317 
No claims lost 156 149 115 420 

Totals: 409 647 886 1942 


Figure 8.2.5 


(Continued on next page) 


Figure 8.2.6 
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The hypotheses to be tested in any categorical-data problem always have 
the same form: 


Ho: X and Y are independent 


versus 


H,: X and Y are dependent 


The formal procedure for choosing between Hp and H is a chi square test, which 
will be covered in Chapter 10. A quick look at these data leaves no doubt 
that Hy) would be overwhelmingly rejected. If X and Y are independent, the 
probability that a physician receives, say, no claims, should be the same for all 
three specialties. The sample proportions of no claims, though, are dramatically 
different from specialty to specialty: 


for Orthopedic surgery— 147/409 = 35.9% 
for Ob-Gyn— 349/647 = 53.9% 
for Internal medicine — 706/886 = 80.0% 


Clearly, the variables X and Y are dependent. 


About the Data The categorical-data format “overlaps” the two-sample data 
format for one particular type of measurement. Consider the simplest version of 
categorical data, where both X and Y have only two values. Call the two values of X 
“success” and “failure,” and the two values of Y “Level 1” and “Level 2.” Given a 
sample of n such observations, the corresponding contingency table would look like 
Figure 8.2.6. 


Y 
Levell Level 2 Totals 
Success a b a+b 
X Failure c d c+d 
Totals: a+c b+d n=a+b+c+d 


Notice that the “a” and “c” in Column 1 are another way of expressing the num- 
bers of 1’s and 0’s, respectively, in a sequence of a+ c Bernoulli trials. Similarly, the 
“b” and “d” in Column 2 tally up the 1’s and 0’s, respectively, in a second set of 
b+d Bernoulli trials. To say that X and Y are dependent (in the categorical-data 
sense) is to say that the difference between a/(a+c) and b/(b+ d) is statisti- 
cally significant (in the two-sample data sense). The two data models answer their 
respective questions with different statistical tests, but the two procedures (a chi 
square test and a Z test) are equivalent—one will reject Ho if and only if the other 
rejects Ho. 


448 Chapter 8 Types of Data: A Brief Overview 


Figure 8.2.7 


Example 


8.2.1 


A Flowchart for Classifying Data 


Differentiating the seven data formats just discussed are the answers to the 
four questions cited at the beginning of this section: Are the data qualitative 
or quantitative? Are the units similar or dissimilar? How many factor levels are 
involved? Are the samples dependent or independent? The flowchart pictured 
in Figure 8.2.7 shows the sequence of responses that leads to each of the seven 
models. 


Start 


|! 


Are the data 
qualitative or 
quantitative? 


Qualitative Categorical 
ee al 


data 


| Quantitative 


Are the units 


ie Dissimilar Regression 
similar or data 
dissimilar? 
Similar 
More than two How many One-sample 


factor levels 
are involved? 


| Two 


data 


Randomized\ Dependent Are the samples Are the samples Dependent Paired 
block data dependent or dependent or >> dais 
independent? independent? 
Independent Independent 


k-sample 


Two-sample 
data data 


The federal Community Reinvestment Act of 1977 was enacted out of concern that 
banks were reluctant to make loans in low- and moderate-income areas, even when 
applicants seemed otherwise acceptable. The figures in Table 8.2.7 show one partic- 
ular bank’s credit penetration in ten low-income census tracts (A through J) and ten 
high-income census tracts (K through T). To which of the seven models do these 
data belong? 

Note, first, that the measurements (1) are quantitative and (2) have similar units. 
Low-income and High-income correspond to two treatment levels, and the two sam- 
ples are clearly independent (the 4.6 recorded in tract A, for example, has nothing 
specific in common with the //.6 recorded in tract K). From the flowchart, then, 
the answers quantitative/similar/two/independent imply that these are two-sample 
data. 


Example 
8.2.2 


8.2 Classifying Data 449 


Table 8.2.7 


Low-Income Percent of Households High-Income Percent of Households 


Census Tract with Credit Census Tract with Credit 
A 4.6 K 11.6 
B 6.6 L 8.5 
C 3.3 M 8.2 
D 9.8 N 15.1 
E 6.9 O 12.6 
F 11.0 P 11.3 
G 6.0 Q 91 
H 4.6 R 4.2 
I 4.2 S 6.4 
J 5.1 T 5.9 


Individuals looking at the vertical lines in Figure 8.2.8 will tend to perceive the right 
one as shorter, even though the two are equal. Moreover, the perceived difference 
in those lengths—what psychologists call the “strength” of the illusion—has been 
shown to be a function of age. 


Figure 8.2.8 


A study was done to see whether individuals who are hypnotized and regressed 
to different ages also perceive the illusion differently. Table 8.2.8 shows the illusion 
strengths measured for eight subjects while they were (1) awake, (2) regressed to age 
nine, and (3) regressed to age five (137). Which of the seven experimental designs 
do these data represent? 

Look again at the sequence of questions posed by the flowchart in Figure 8.2.7: 


Are the data qualitative or quantitative? Quantitative 

Are the units similar or dissimilar? Similar 

How many factor levels are involved? More than two 

Are the observations dependent or independent? Dependent 


Penh 


450 Chapter 8 Types of Data: A Brief Overview 


Table 8.2.8 
(1) (2) (3) 
Regressed Regressed 
Subject Awake to Age 9 to Age 5 
1 0.81 0.69 0.56 
2 0.44 0.31 0.44 
3 0.44 0.44 0.44 
4 0.56 0.44 0.44 
5 0.19 0.19 0.31 
6 0.94 0.44 0.69 
7 0.44 0.44 0.44 
8 0.06 0.19 0.19 


According to the flowchart, then, these measurements qualify as randomized block 


data. 


Questions 


For Questions 8.2.1-8.2.12 use the flowchart in Figure 8.2.7 
to identify the experimental designs represented. In each 
case, answer whichever of the questions on p. 435 are 
necessary to make the determination. 


8.2.1. Kepler’s Third Law states that “the squares of the 
periods of the planets are proportional to the cubes of 
their mean distance from the Sun.” Listed below are the 
periods of revolution (x), the mean distances from the sun 
(y), and the values x?/y? for the eight planets in the solar 
system (3). 


Planet x; (years) y; (astronomical units) x?/y? 
Mercury 0.241 0.387 1.002 
Venus 0.615 0.723 1.001 
Earth 1.000 1.000 1.000 
Mars 1.881 1.524 1.000 
Jupiter 11.86 5.203 0.999 
Saturn 29.46 9.54 1.000 
Uranus 84.01 19.18 1.000 
Neptune 164.8 30.06 1.000 


8.2.2. Mandatory helmet laws for motorcycle riders 
remain a controversial issue. Some states have a “limited” 
ordinance that applies only to younger riders; others have 
a “comprehensive” statute requiring all riders to wear hel- 
mets. Listed in the next column are the deaths per ten 
thousand registered motorcycles reported by states having 
each type of legislation (184). 


Limited Helmet Law Comprehensive Helmet Law 


6.8 7.0 oi 7.1 4.8 7.0 
10.6 4.1 0.5 11.2 5.0 6.8 
9.6 5.7 6.7 17.9 8.1 7.3 
91 7.6 6.4 11.3 5.5 12.9 
5.2 3.0 4.7 85 11.7 3.7 
13.2. 6.7 15.0 9.3 4.0 5.2 
6.9 7.3 4.7 5.4 7.0 6.9 
8.1 42 4.8 10.5 9.3 8.6 


8.2.3. Aedes aegypti is the scientific name of the mosquito 
that transmits yellow fever. Although no longer a major 
health problem in the Western world, yellow fever was 
perhaps the most devastating communicable disease in 
the United States for almost two hundred years. To see 
how long it takes the Aedes mosquito to complete a feed- 
ing, five young females were allowed to bite an exposed 
human forearm without the threat of being swatted. The 
resulting blood-sucking times (in seconds) are summa- 
rized below (89). 


Mosquito Bite Duration (sec) 


1 176.0 
2 202.9 
3 315.0 
4 374.6 
5 392;5 


8.2.4. Male cockroaches can be very antagonistic toward 
other male cockroaches. Encounters may be fleeting or 


quite spirited, the latter often resulting in missing anten- 
nae and broken wings. A study was done to see whether 
cockroach density has any effect on the frequency of seri- 
ous altercations. Ten groups of four male cockroaches 
(Byrsotria fumigata) were each subjected to three levels 
of density: high, intermediate, and low. The following are 
the numbers of “serious” encounters per minute that were 
observed (14). 


Group High Intermediate Low 
1 0.30 0.11 0.12 
2 0.20 0.24 0.28 
3 0.17 0.13 0.20 
4 0.25 0.36 0.15 
5 0.27 0.20 0.31 
6 0.19 0.12 0.16 
7 0.27 0.19 0.20 
8 0.23 0.08 0.17 
9 0.37 0.18 0.18 

10 0.29 0.20 0.20 
Averages: 0.25 0.18 0.20 


8.2.5. Luxury suites, many costing more than $100,000 to 
rent, have become big-budget status symbols in new sports 
arenas. Below are the numbers of suites (x) and their 
projected revenues (y) for nine of the country’s newest 
facilities (196). 


Number of Projected Revenues 


Arena Suites, x (in millions), y 

Palace (Detroit) 180 $11.0 

Orlando Arena 26 1.4 

Bradley Center 68 3.0 
(Milwaukee) 

America West 88 6.0 
(Phoenix) 

Charlotte Coliseum 12 0.9 

Target Center 67 4.0 
(Minneapolis) 

Salt Lake City Arena 56 3.5 

Miami Arena 18 1.4 

ARCO Arena 30 2.7 
(Sacramento) 


8.2.6. Depth perception is a life-or-death ability for lambs 
inhabiting rugged mountain terrain. How quickly a lamb 
develops that faculty may depend on the amount of time 
it spends with its ewe. Thirteen sets of lamb littermates 
were the subjects of an experiment that addressed that 
question (99). One member of each litter was left with its 
mother; the other was removed immediately after birth. 
Once every hour, the lambs were placed on a simulated 
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cliff, part of which included a platform of glass. If a lamb 
placed its feet on the glass, it “failed” the test, since that 
would have been equivalent to walking off the cliff. Below 
are the trial numbers when the lambs first learned not to 
walk on the glass —that is, when they first developed depth 
perception. 


Number of Trials to Learn 
Depth Perception 


Group Mothered, x; Unmothered, y; 


1 2 3 
2 3 11 
3 5 10 
4 3 5 
> 2 3S) 
6 1 4 
f) 1 2 
8 5 7 
9 3 5 
10 1 4 
11 7h 8 
12 3 12 
13 5 7 


8.2.7. To see whether teachers’ expectations for students 
can become self-fulfilling prophecies, fifteen first graders 
were given a standard IQ test. The childrens’ teachers, 
though, were told it was a special test for predicting 
whether a child would show sudden spurts of intellectual 
growth in the near future (see 147). Researchers divided 
the children into three groups of sizes 6, 5, and 4 at ran- 
dom, but they informed the teachers that, according to the 
test, the children in group I would not demonstrate any 
pronounced intellectual growth for the next year, those in 
group II would develop at a moderate rate, and those in 
group III could be expected to make exceptional progress. 
A year later, the same fifteen children were again given 
a standard IQ test. Below are the differences in the two 
scores for each child (second test — first test). 


Changes in IQ (second test — first test) 


Group I Group I Group IIT 
3 10 20 
2 4 9 
6 11 18 
10 14 19 
10 3 
5 


8.2.8. Among young drivers, roughly a third of all fatal 
automobile accidents are speed-related; by age 60 that 
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proportion drops to about one-tenth. Listed below are 
a recent year’s percentages of speed-related fatalities for 
ages ranging from 16 to 72 (189). 


Age Percent Speed-Related Fatalities 
16 37 
17 32 
18 33 
19 34 
20 33 
22 31 
24 28 
27 26 
32 23 
42 16 
52 13 
57 10 
62 9 
72 7 


8.2.9. Gorillas are not the solitary creatures that they are 
often made out to be: they live in groups whose average 
size is about 16, which usually includes three adult males, 
six adult females, and seven “youngsters.” Listed below 
are the sizes of ten groups of mountain gorillas observed 
in the volcanic highlands of the Albert National Park in 
the Congo (157). 


Group No. of Gorillas 
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8.2.10. Roughly 360,000 bankruptcies were filed in USS. 
Federal Court during 1981; by 1990 the annual num- 
ber was more than twice that figure. The following are 
the numbers of business failures reported year by year 
through the 1980s (175). 


Year Bankruptcies Filed 
1981 360,329 
1982 367,866 
1983 374,734 
1984 344,275 


1985 
1986 
1987 
1988 
1989 
1990 


364,536 
477,856 
561,274 
594,567 
642,993 
726,484 


8.2.11. The diversity of bird species in a given area is 
related to plant diversity, as measured by variation in 
foliage heights as well as the variety of flora. Below are 
indices measured on those two traits for thirteen desert- 
type habitats (109). 


Plant Cover Bird Species 

Area Diversity, x; Diversity, y; 
1 0.90 1.80 
2 0.76 1.36 
3 1.67 2.92 
4 1.44 2.61 
5 0.20 0.42 
6 0.16 0.49 
7 1.12 1.90 
8 1.04 2.38 
9 0.48 1.24 
10 1.33 2.80 
11 1.10 2.41 
12 1.56 2.80 
13 1.15 2.16 


8.2.12. Male toads often have trouble distinguishing 
between other male toads and female toads, a state of 
affairs that can lead to awkward moments during mating 
season. When male toad A inadvertently makes inappro- 
priate romantic overtures to male toad B, the latter emits 
a short call known as a release chirp. Below are the lengths 
of the release chirps measured for fifteen male toads 
innocently caught up in misadventures of the heart (17). 


Toad Length of Release Chirp (sec) 
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For Questions 8.2.13-8.2.32 identify the experimental 
design (one-sample, two-sample, etc.) that each set of data 
represents. 


8.2.13. A pharmaceutical company is testing two new 
drugs designed to improve the blood-clotting ability of 
hemophiliacs. Six subjects volunteering for the study are 
randomly divided into two groups of size 3. The first group 
is given drug A; the second group, drug B. The response 
variable in each case is the subject’s prothrombin time, a 
number that reflects the time it takes for a clot to form. 
The results (in seconds) for group A are 32.6, 46.7, and 
81.2; for group B, 25.9, 33.6, and 35.1. 


8.2.14. Investment firms financing the construction of 
new shopping centers pay close attention to the amount 
of retail floor space already available. Listed below are 
population and floor space figures for five southern cities. 


Retail Floor Space 
City Population,x (in million square meters), y 
1 400,000 3,450 
2 150,000 1,825 
3 1,250,000 7,480 
4 2,975,000 14,260 
5 760,000 5,290 


8.2.15. Nine political writers were asked to assess the 
United States’ culpability in murders committed by rev- 
olutionary groups financed by the CIA. Scores were 
assigned using a scale of 0 to 100. Three of the writers were 
native Americans living in the United States, three were 
native Americans living abroad, and three were foreign 
nationals. 


Americans in US. Americans Abroad Foreign 
Nationals 
45 65 75 
45 50 90 
40 55 85 


8.2.16. To see whether low-priced homes are easier to sell 
than moderately priced homes, a national realty company 
collected the following information on the lengths of times 
homes were on the market before being sold. 


Number of Days on Market 
City Low-Priced Moderately Priced 
Buffalo 55 70 
Charlotte 40 30 
Newark 70 110 
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8.2.17. The following is a breakdown of what 120 college 
freshmen intend to do next summer. 


Work School Play 


Male 22 14 19 
Female 14 31 20 


8.2.18. An efficiency study was done on the delivery of 
first-class mail originating from the four cities listed in the 
following table. Recorded for each city was the average 
length of time (in days) that it took a letter to reach a 
destination in that same city. Samples were taken on two 
occasions, Sept. 1, 2001 and Sept. 1, 2004. 


City Sept. 1,2001 Sept. 1, 2004 
Wooster 1.8 1.7 
Midland 2.0 2.0 
Beaumont 2:2: 2.5 
Manchester 1.9 1.7 


8.2.19. Two methods (A and B) are available for remov- 
ing dangerous heavy metals from public water supplies. 
Eight water samples collected from various parts of the 
United States were used to compare the two methods. 
Four were treated with Method A and four were treated 
with Method B. After the processes were completed, each 
sample was rated for purity on a scale of 1 to 100. 


Method A Method B 


88.6 81.4 
92.1 84.6 
90.7 91.4 
93.6 78.6 


8.2.20. Out of 120 senior citizens polled, 65 favored a 
complete overhaul of the health care system while 55 pre- 
ferred more modest changes. When the same choice was 
put to 85 first-time voters, 40 said they were in favor of 
major reform while 45 opted for minor revisions. 


8.2.21. To illustrate the complexity and arbitrariness of 
IRS regulations, a tax-reform lobbying group has sent the 
same five clients to each of two professional tax preparers. 
The following are the estimated tax liabilities quoted by 
each of the preparers. 
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Client Preparer A Preparer B 
GS $31,281 $26,850 
MB 14,256 13,958 
AA 26,197 25,520 
DP 8,283 9,107 
SB 47,825 43,192 


8.2.22. The production of a certain organic chemi- 
cal requires ammonium chloride. The manufacturer can 
obtain the ammonium chloride in one of three forms: 
powdered, moderately ground, and coarse. To see if the 
consistency of the NH,Cl is itself a factor that needs to be 
considered, the manufacturer decides to run the reaction 
seven times with each form of ammonium chloride. The 
following are the resulting yields (in pounds). 


Moderately 
Powdered NH,Cl Ground NH,Cl Coarse NH,Cl 
146 150 141 
152 144 138 
149 148 142 
161 155 146 
158 154 139 
154 148 137 
149 150 145 


8.2.23. An investigation was conducted of 107 fatal poi- 
sonings of children. Each death was caused by one of 
three drugs. In each instance it was determined how the 
child received the fatal overdose. Responsibility for the 
107 accidents was assessed according to the following 
breakdown. 


Drug A DrugB Drug C 


Child Responsible 10 10 18 
Parent Responsible 10 14 10 
Another Person Responsible 4 18 13 


8.2.24. As part of an affirmative-action litigation, records 
were produced showing the average salaries earned by 
white, black, and Hispanic workers in a large manufac- 
turing plant. Three different departments were selected 
at random for the comparison. The entries shown are 
average annual salaries, in thousands of dollars. 


White Black Hispanic 
Department1 40.2 39.8 39.9 
Department2 40.6 39.0 39.2 
Department3 39.7 40.0 38.4 


8.2.25. In Eastern Europe a study was done on fifty peo- 
ple bitten by rabid animals. Twenty victims were given the 
standard Pasteur treatment, while the other thirty were 
given the Pasteur treatment in addition to one or more 
doses of antirabies gamma globulin. Nine of those given 
the standard treatment survived; twenty survived in the 
gamma globulin group. 


8.2.26. To see if any geographical pricing differences 
exist, the cost of a basic-cable TV package was determined 
for a random sample of six cities, three in the southeast 
and three in the northwest. Monthly charges for the south- 
eastern cities were $13.20, $11.55, and $16.75; residents 
in the three northwestern cities paid $14.80, $17.65, and 
$19.20. 


8.2.27. A public relations firm hired by a would-be presi- 
dential candidate has conducted a poll to see whether their 
client faces a gender gap. Out of 800 men interviewed, 
325 strongly supported the candidate, 151 were strongly 
opposed, and 324 were undecided. Among the 750 women 
included in the sample, 258 were strong supporters, 241 
were strong opponents, and 251 were undecided. 


8.2.28. As part of a review of its rate structure, an auto- 
mobile insurance company has compiled the following 
data on claims filed by five male policyholders and five 
female policyholders. 


Client Claims Filed Client Claims Filed 
(male) in 2004 (female) in 2004 
MK $2750 SB 0 
JM 0 ML 0 
AK 0 MS 0 
KT $1500 BM $2150 


JT 0 LL 0 


8.2.29. A company claims to have produced a blended 
gasoline that can improve a car’s fuel consumption. They 
decide to compare their product with the leading gas cur- 
rently on the market. Three different cars were used for 
the test: a Porsche, a Buick, and a VW. The Porsche got 
13.6 mpg with the new gas and 12.2 mpg with the “stan- 
dard” gas; the Buick got 18.7 mpg with the new gas and 
18.5 with the standard; the figures for the VW were 34.5 
and 32.6, respectively. 


8.2.30. In a survey conducted by State University’s 
Learning Center, a sample of three freshmen said they 
studied 6, 4, and 10 hours, respectively, over the weekend. 
The same question was posed to three sophomores, who 
reported study times of 4,5, and 7 hours. For three juniors, 
the responses were 2, 8, and 6 hours. 


8.2.31. A consumer advocacy group, investigating the 
prices of steel-belted radial tires produced by three major 
manufacturers, collects the following data. 
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8.2.32. A small fourth-grade class is randomly split into 


Year Company A Company B Company C ; ; : ; 
two groups. Each group is taught fractions using a dif- 
1995 $62.00 $68.00 $65.00 ferent method. After three weeks, both groups are given 
2000 $70.00 $72.00 $69.00 the same 100-point test. The scores of students in the first 
2005 $78.00 $75.00 $75.00 group are 82, 86, 91, 72, and 68; the scores reported for the 


second group are 76, 63, 80, 72, and 67. 


8.3 Taking a Second Look at Statistics 
(Samples Are Not “Valid”!) 


Designing an experiment invariably requires that two fundamental issues be 
resolved. First and foremost is the choice of the design itself. Based on the type 
of data available and the objectives to be addressed, what overall “structure” 
should the experiment have? Seven of the most frequently occurring answers to 
that question are the seven models profiled in this chapter, ranging from the 
simplicity of the one-sample design to the complexity of the randomized block 
design. 

As soon as a design has been chosen, a second question immediately follows: 
How large should the sample size (or sample sizes) be? It is precisely that question, 
though, that leads to a very common sampling misconception. There is a widely 
held belief (even by many experienced experimenters, who should know better) 
that some samples are “valid” (presumably because of their size), while others are 
not. Every consulting statistician could probably retire to Hawaii at an early age if 
he or she got a dollar for every time an experimenter posed the following sort of 
question: “I intend to compare Treatment X and Treatment Y using the two-sample 
format. My plan is to take twenty measurements on each of the two treatments. Will 
those be valid samples?” 

The sentiment behind such a question is entirely understandable: the researcher 
is asking whether two samples of size 20 will be “adequate” (in some sense) for 
addressing the objectives of the experiment. Unfortunately, the word “valid” is 
meaningless in this context. There is no such thing as a valid sample because the 
word “valid” has no statistical definition. 

To be sure, we have already learned how to calculate the smallest values of n 
that will achieve certain objectives, typically expressed in terms of the precision of an 
estimator or the power of a hypothesis test. Recall Theorem 5.3.2. To guarantee that 
the estimator X/n for the binomial parameter p has at least a 100(1 — a)% chance 
of lying within a distance d of p requires that n be as least as large as z2 p /4d°. 

Suppose, for example, that we want a sample size capable of guaranteeing that 
X/n will have an 80%[= 100(1 — a)%] chance of being within 0.05 (= d) of p. By 
Theorem 5.3.2, 


_, {1.28)" 


REHEEyE 916A 
"= 40.05) 


On the other hand, that sample of n = 164 would not be large enough to guarantee 
that X/n has, say, a 95% chance of being within 0.03 of p. To meet these latter 
requirements, n would have to be as least as large as 1068 [= (1.96)*/4(0.03)7]. 
Therein lies the problem. Sample sizes that can satisfy one set of specifica- 
tions will not necessarily be capable of satisfying another. There is no “one size 
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fits all” value for n that qualifies a sample as being “adequate” or “sufficient” or 
“valid.” 

In a broader sense, the phrase “valid sample” is much like the expression 
“statistical tie” discussed in Section 5.3. Both are widely used, and each is a well- 
intentioned attempt to simplify an important statistical concept. Unfortunately, both 
also share the dubious distinction of being mathematical nonsense. 


Chapter 


'TWO-SAMPLE INFERENCES 


9.1 
9.2 
9.3 
9.4 
9.5 


Introduction 9.6 Taking a Second Look at Statistics (Choosing 
Testing Ho: wx = Uy Samples) 

Testing Ho: 0g =o; —The F Test Appendix 9.A.1 A Derivation of the Two-Sample 
Binomial Data: Testing Ho: py = py t Test (A Proof of Theorem 9.2.2) 
Confidence Intervals for the Two-Sample Appendix 9.A.2 Minitab Applications 

Problem 


After earning an Oxford degree in mathematics and chemistry, Gosset began 
working in 1899 for Messrs. Guinness, a Dublin brewery. Fluctuations in materials 
and temperature and the necessarily small-scale experiments inherent in brewing 
convinced him of the necessity for a new, small-sample theory of statistics. Writing 
under the pseudonym “Student,” he published work with the t ratio that was destined 
to become a cornerstone of modern statistical methodology. 

— William Sealy Gosset (“Student”) (1876-1937) 


9.1 Introduction 


The simplicity of the one-sample model makes it the logical starting point for 
any discussion of statistical inference, but it also limits its applicability to the real 
world. Very few experiments involve just a single treatment or a single set of condi- 
tions. On the contrary, researchers almost invariably design experiments to compare 
responses to several treatment levels—or, at the very least, to compare a single 
treatment with a control. 

In this chapter we examine the simplest of these multilevel designs, two-sample 
inferences. Structurally, two-sample inferences always fall into one of two different 
formats: Either two (presumably) different treatment levels are applied to two inde- 
pendent sets of similar subjects or the same treatment is applied to two (presumably) 
different kinds of subjects. Comparing the effectiveness of germicide A relative to 
that of germicide B by measuring the zones of inhibition each one produces in two 
sets of similarly cultured Petri dishes would be an example of the first type. On the 
other hand, examining the bones of sixty-year-old men and sixty-year-old women, all 
lifelong residents of the same city, to see whether both sexes absorb environmental 
strontium-90 at the same rate would be an example of the second type. 

Inference in two-sample problems usually reduces to a comparison of location 
parameters. We might assume, for example, that the population of responses asso- 
ciated with, say, treatment X is normally distributed with mean wx and standard 
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Theorem 
9.2.1 


deviation oy while the Y distribution is normal with mean py and standard devi- 
ation oy. Comparing location parameters, then, reduces to testing Ho: wx = Wy. AS 
always, the alternative may be either one-sided, Hi: wx < fy or Hy: Wx > Ly, or two- 
sided, Hi: wx # py. (If the data are binomial, the location parameters are py and 
py, the true “success” probabilities for treatments X and Y, and the null hypothesis 
takes the form Ho: px = py.) 

Sometimes, although much less frequently, it becomes more relevant to com- 
pare the variabilities of two treatments, rather than their locations. A food company, 
for example, trying to decide which of two types of machines to buy for filling cereal 
boxes would naturally be concerned about the average weights of the boxes filled 
by each type, but they would also want to know something about the variabilities 
of the weights. Obviously, a machine that produces high proportions of “underfills” 
and “overfills” would be a distinct liability. In a situation of this sort, the appropriate 
null hypothesis is Ho: 07 = 0}. 

For comparing the means of two normal populations when ox =oy, the standard 
procedure is the two-sample t test. As described in Section 9.2, this is a relatively 
straightforward extension of Chapter 7’s one-sample f test. If oy A oy, an approxi- 
mate ¢ test is used. For comparing variances, though, it will be necessary to introduce 
a completely new test—this one based on the F distribution of Section 7.3. The 
binomial version of the two-sample problem, testing Ho: px = py, is taken up in 
Section 9.4. 

It was mentioned in connection with one-sample problems that certain infer- 
ences, for various reasons, are more aptly phrased in terms of confidence intervals 
rather than hypothesis tests. The same is true of two-sample problems. In Section 9.5, 
confidence intervals are constructed for the location difference of two populations, 
[x — [ty (or px — py), and the variability quotient, 02/07. 


9.2 Testing Ho: ux = Ly 


We will suppose that the data for a given experiment consist of two independent 
random samples, X;, X2,..., X, and Y, Y2,..., Ym, representing either of the models 
referred to in Section 9.1. Furthermore, the two populations from which the X’s and 
Y’s are drawn will be presumed normal. Let x and py denote their means. Our 
objective is to derive a procedure for testing Ho: wx = Ly. 

As it turns out, the precise form of the test we are looking for depends on the 
variances of the X and Y populations. If it can be assumed that 02 and of are equal, 
it is a relatively straightforward task to produce the GLRT for Ho: 1x = wy. (This is, 
in fact, what we will do in Theorem 9.2.2.) But if the variances of the two populations 
are not equal, the problem becomes much more complex. This second case, known 
as the Behrens-Fisher problem, is more than seventy-five years old and remains one 
of the more famous “unsolved” problems in statistics. What headway investigators 
have made has been confined to approximate solutions. These will be discussed later 
in this section. For what follows next, it can be assumed that of = 07. 

For the one-sample test 4 = zo, the GLRT was shown to be a function of a spe- 
cial case of the f ratio introduced in Definition 7.3.3 (recall Theorem 7.3.5). We begin 
this section with a theorem that gives still another special case of Definition 7.3.3. 


Let X\, X2,...,Xn be a random sample of size n from a normal distribution with 
mean Lx and standard deviation o and let Y,, Yr, ..., Ym be an independent random 
sample of size m from a normal distribution with mean t1y and standard deviation o. 
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Let Sx and S%. be the two corresponding sample variances, and Ss? the pooled variance, 
where 


n Ae 4 m a 2 
_@-pst@—y3 2% ns 


Le n+m—2 n+m—2 


Then 
X—Y—-(ux— py) 
5? a + i 


n m 


Th4+m—2 = 


has a Student t distribution with n +m — 2 degrees of freedom. 


Proof The method of proof here is very similar to what was used for Theorem 7.3.5. 
Note that an equivalent formulation of 7.4m 2 is 


X-Y-(uy—py) 
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oN atin 
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Thtm—2 = 


X-Y-(ux—py) 
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But E(X — Y) =x — wy and Var(X — Y) =o7/n+o7/m, so the numerator of the 
ratio has a standard normal distribution, f7(z). 


In the denominator, 
5 =ard 
3 X,-X\  m@-DS}; 
o a 2 


oO 


and 


m —— 2 
y,-Y (m — 1)S? 
ZI = = 2 


f oO 
i=l 


are independent x? random variables with n — 1 and m — 1 df, respectively, so 


EC) EC) 


i=1 


has a x? distribution with n + m — 2 df (recall Theorem 7.3.1 and Theorem 4.6.4). 
Also, by Appendix 7.A.2, the numerator and denominator are independent. 
It follows from Definition 7.3.3, then, that 


X —Y — (ux — py) 
Lc 
Spy int in 


has a Student rf distribution with n + m — 2 df. 
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Theorem Let x1, X2,...,;%, and y\, y2,...,¥m be independent random samples from normal 
9.2.2 distributions with means tx and jy, respectively, and with the same standard 
deviation o. Let 
7-7 
t= y 
Sp ae 


s[e 
3|- 


a. To test Ho: wx = ky versus Hy: Lx > Ly at the a level of significance, reject Ho if 
t= ta.n+m—2- 

b. To test Ho: wx = wy versus H;: wx < [Ly at the a level of significance, reject Ho if 
t < la ntm—2: 

c. To test Ho: wx = Ly versus Hy: x 4 [Ly at the a level of significance, reject Hp if 
t is either (1) = —le/2,.n+m—2 or (2) = tu /2,.n+m—2- 


Proof See Appendix 9.A.1. 


Case Study 9.2.1 


The mystery surrounding the nature of Mark Twain’s participation in the Civil 
War was discussed (but not resolved) in Case Study 1.2.2. Recall that historians 
are still unclear as to whether the creator of Huckleberry Finn and Tom Sawyer 
was a civilian or a combatant in the early 1860s and whether his sympathies lay 
with the North or with the South. 

A tantalizing clue that might shed some light on the matter is a set of ten 
war-related essays written by one Quintus Curtius Snodgrass, who claimed to 
be in the Louisiana militia, although no records documenting his service have 
ever been found. If Snodgrass was just a pen name Twain used, as some suspect, 
then these essays are basically a diary of Twain’s activities during the war, and 
the mystery is solved. If Quintus Curtius Snodgrass was not a pen name, these 
essays are just a red herring, and all questions about Twain’s military activities 
remain unanswered. 

Assessing the likelihood that Twain and Snodgrass were one and the 
same would be the job of a “forensic statistician.” Authors have character- 
istic word-length profiles that effectively serve as verbal fingerprints (much 
like incriminating evidence left at a crime scene). If Authors A and B tend to 
use, say, three-letter words with significantly different frequencies, a reasonable 
inference would be that A and B are different people. 

Table 9.2.1 shows the proportions of three-letter words in each of the ten 
Snodgrass essays and in eight essays known to have been written by Mark 
Twain. If x; denotes the ith Twain proportion, i = 1,2,...,8, and y; denotes 
the ith Snodgrass proportion, i =1,2,..., 10, then 


8 
| x; = 1.855 so ¥ = 1.855/8 = 0.2319 
i=1 


(Continued on next page) 
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Table 9.2.1 Proportion of Three-Letter Words 

Twain Proportion QCS Proportion 

Sergeant Fathom letter 0.225 Letter I 0.209 

Madame Caprell letter 0.262 Letter IT 0.205 

Mark Twain letters in Letter III 0.196 

Territorial Enterprise Letter IV 0.210 

First letter 0.217 Letter V 0.202 
Second letter 0.240 Letter VI 0.207 
Third letter 0.230 Letter VII 0.224 
Fourth letter 0.229 Letter VII 0.223 

First Innocents Abroad letter Letter IX 0.220 
First half 0.235 Letter X 0.201 
Second half 0.217 

and 


10 
~ yj =2.097 so ¥ = 2.097/10 = 0.2097 


i=l 


The question to be answered is whether the difference between 0.2319 and 
0.2097 is statistically significant. 

Let zx and jy denote the true average proportions of three-letter words 
that Twain and Snodgrass, respectively, tended to use. Our objective is to test 


Ay: “wx =bMy 
versus 


Hy: wx Ay 


Since 


8 10 
y x7 =0.4316 and >” y? =0.4406 
i=l 


i=l 


the two sample variances are 


8(0.4316) — (1.855)” 


2 
x = 8(7) 
= 0.0002103 
and 
Qe 10(0.4406) — (2.097)? 
. 10(9) 
= 0.0000955 


(Continued on next page) 
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(Case Study 9.2.1 continued) 


Combined, they give a pooled standard deviation of 0.0121: 


8 10 
>> @; — 0.2319)? + }* (y; — 0.2097)? 


i=l i=1 
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_ —— + 9(0.0000955) 


8+10-—2 
= V0.0001457 
=0.0121 
According to Theorem 9.2.1, if Ho: wx = ty Is true, the sampling distribu- 
tion of 
x-Y 
a 
Spyet io 


is described by a Student f curve with 16 (= 8+ 10 —2) degrees of freedom. 

Suppose we let w = 0.01. By part (c) of Theorem 9.2.2, Hp should be rejected 
in favor of a two-sided H;, if either (1) t < —fy/2.n4m—2 = —to0s,16 = —2.9208 or 
(2) t > te/2,n+m—2 = t.005,16 = 2.9208 (see Figure 9.2.1). But 


 — 0:2319 ~ 0.2097 


1 1 
0.0121,/3+ 4 


= 3.88 
Student t 
distribution 
Area = 0.005 with 16 df 
— 2.9208 2.9208 
Reject H-— L___Reject H, 


Figure 9.2.1 


a value falling considerably to the right of toos,16. Therefore, we should reject 
Hy —it appears that Twain and Snodgrass were not the same person. So, unfor- 
tunately, nothing that Twain did can be inferred from anything that Snodgrass 
wrote. 


About the Data The X;’s and Y;’s in Table 9.2.1, being proportions, are necessar- 
ily not normally distributed random variables with the same variance, so the basic 
conditions of Theorem 9.2.2 are not met. Fortunately, the consequences of violated 
assumptions on the probabilistic behavior of T,4m—-2 are frequently minimal. The 
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robustness property of the one-sample t ratio that we investigated in Chapter 7 also 
holds true for the two-sample t ratio. 


Case Study 9.2.2 


Dislike your statistics instructor? Retaliation time will come at the end of the 
semester, when you pepper the student course evaluation form with 1’s. Were 
you pleased? Then send a signal with a load of 5’s. Either way, students’ evalu- 
ations of their instructors do matter. These instruments are commonly used for 
promotion, tenure, and merit raise decisions. 

Studies of student course evaluations show that they do have value. They 
tend to show reliability and consistency. Yet questions remain as to the ability 
of these questionnaires to identify good teachers and courses. 

A veteran instructor of developmental psychology decided to do a study 
(201) on how a single changed factor might affect his students’ course evalua- 
tions. He had attended a workshop extolling the virtue of an enthusiastic style 
in the classroom—more hand gestures, increased voice pitch variability, and the 
like. The vehicle for the study was the large-lecture undergraduate develop- 
mental psychology course he had taught in the fall semester. He set about to 
teach the spring-semester offering in the same way, with the exception of a more 
enthusiastic style. 

The professor fully understood the difficulty of controlling for the many 
variables. He selected the spring class to have the same demographics as the 
one in the fall. He used the same textbook, syllabus, and tests. He listened 
to audiotapes of the fall lectures and reproduced them as closely as possible, 
covering the same topics in the same order. 

The first step in examining the effect of enthusiasm on course evaluations 
is to establish that students have, in fact, perceived an increase in enthusiasm. 
Table 9.2.2 summarizes the ratings the instructor received on the “enthusiasm” 
question for the two semesters. Unless the increase in sample means (2.14 to 
4.21) is statistically significant, there is no point in trying to compare fall and 
spring responses to other questions. 


Table 9.2.2 


Fall, Xj Spring, Ji 


n=229 m= 243 
X¥=2.14 y=4.21 
Sy =0.94 sy =0.83 


Let wy and py denote the true means associated with the two different 
teaching styles. There is no reason to think that increased enthusiasm on the 
part of the instructor would decrease the students’ perception of enthusiasm, so 
it can be argued here that H; should be one-sided. That is, we want to test 


Ao: x = [bly 
versus 
Ay: fly < ply 


(Continued on next page) 
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(Case Study 9.2.2 continued) 


Let a =0.05. 
Since n = 229 and m = 243, the t statistic has 229 + 243 — 2 =470 degrees of 
freedom. Thus, the decision rule calls for the rejection of Hp if 
7-7 
t= i ud ; < tun tm—2 = —t05,470 
Spy) x29 243 


A glance at Table A.2 in the Appendix shows that for any value n > 100, zy is a 
good approximation of ty,,. That is, —f95, 470 = —Z.05 = —1.64. 
The pooled standard deviation for these data is 0.885: 


= 0.885 


228(0.94)2 + 242(0.83)2 
Sp 
. 9-014. 9 


Therefore, 


2.14—4.21 
t= = —25.42 


fi 1 
0.885,/ 555 + a43 


and our conclusion is a resounding rejection of Hj —the increased enthusiasm 
was, indeed, noticed. 

The real question of interest is whether the change in enthusiasm produced 
a perceived change in some other aspect of teaching that we know did not 
change. For example, the instructor did not become more knowledgeable about 
the material over the course of the two semesters. The student ratings, though, 
disagree. 

Table 9.2.3 shows the instructor’s fall and spring ratings on the “knowledge- 
able” question. Is the increase from x = 3.61 to y= 4.05 statistically significant? 
Yes. For these data, sp =0.898 and 


3.61 — 4.05 


0.898,/ 35 + a5 


which falls far to the left of the 0.05 critical value (= —1.64). 

What we can glean from these data is both reassuring yet a bit disturb- 
ing. Table 9.2.2 appears to confirm the widely held belief that enthusiasm 
is an important factor in effective teaching. Table 9.2.3, on the other hand, 
strikes a more cautionary note. It speaks to another widely held belief—that 
student evaluations can sometimes be difficult to interpret. Questions that pur- 
port to be measuring one trait may, in fact, be reflecting something entirely 
different. 


Table 9.2.3 


Fall, Xj Spring, Ji 


n=229. m=243 
¥=3.61 y=4.05 
sy =0.84 sy =0.95 
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About the Data The five-choice responses in student evaluation forms are very 
common in survey questionnaires. Such questions are known as Likert items, 
named after the psychologist Rensis Likert. The item typically asks the respon- 
dent to choose his or her level of agreement with a statement, for example, 
“The instructor shows concern for students.” The choices start with “strongly dis- 
agree,” which is scored with a “1,” and go up to a “5” for “strongly agree.” 
The statistic for a given question in a survey is the average value taken over all 
responses. 

Is at test an appropriate way to analyze data of this sort? Maybe, but the nature 
of the responses raises some serious concerns. First of all, the fact that students talk 
with each other about their instructors suggests that not all the sample values will 
be independent. More importantly, the five-point Likert scale hardly resembles the 
normality assumption implicit in a Student f analysis. For many practitioners —but 
not all—the robustness of the t test would be enough to justify the analysis described 
in Case Study 9.2.2. 


The Behrens-Fisher Problem 


Finding a statistic with known density for testing the equality of two means from 
normally distributed random samples when the standard deviations of the samples 
are not equal is known as the Behrens-Fisher problem. No exact solution is known, 
but a widely used approximation is based on the test statistic 


X—Y —(ux — my) 
s2 S2 
Men. Fe 


where, as usual, X and Y are the sample means, and S% and S} are the unbiased 
estimators of the variance. B. L. Welch, a faculty member at University College, 
London, in a 1938 Biometrika article showed that W is approximately distributed 
as a Student ¢ random variable with degrees of freedom given by the nonintuitive 
expression 


w= 


eae ae 
ni(mj—1) 5 (n2—1) 


To understand Welch’s approximation, it helps to rewrite the random variable 


W as 
aed > o> [Sk S$ 
at = ie la) A ie ie) Nae ie 


We 2 2 = 2 2 : 2 2 
Sx Sy ox. oy: ox. Oy 
Vat Van tw Vat we 


In this form, the numerator is a standard normal variable. Suppose there is a chi 
square random variable V with v degrees of freedom such that the square of the 
denominator is equal to V/v. Then the expression would indeed be a Student t¢ 
variable with v degrees of freedom. However, in general, the denominator will 
not have exactly that distribution. The strategy, then, is to find an approximate 
equality for 


Sey Sy 
ae ee 
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or, equivalently, 


Sz S2 ot o2\V 
Xo L=/ X 4 L) 
n m n m/v 


At issue is the value of v. The method of moments (recall Section 5.2) suggests a 
solution. If the means and variances of both sides are equated, it can be shown that 


Ox ae Oy 


n2(n—1) 


m2(m—1) 


= 

. . . OY 

Moreover, the expression for v depends only on the ratio of the variances, @ = 4. 
Y 


To see why, divide the numerator and denominator by o;}. Then 


2 2 
Lox yd ys 
(GV+m) _a+2) 
2\2 ~~ 1 Q2 1 
1 ox 1 2 + Diy 
n2(n—1) (=) + m?(m—1) et meet) 


and multiplying numerator and denominator by n? gives the somewhat more 
appealing form 


(0+ 4)" 


ae 7 ae (4)° 


Of course, the main application of this theory occurs when og and o; are 
2 
unknown and 6 must thus be estimated, the obvious choice being 6 = “£. 


ne 4 
This leads us to the following theorem for testing the equality of means when 
the variances cannot be assumed equal. 


Let X, X2,...,X» and Y, Y2,..., Yin be independent random samples from normal 
distributions with means Lx and ty, and standard deviations ox and oy, respectively. 
Let 


Ma ati 
we (Ux — Ly) 


2 
ea 
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(6+ ay 
eo tp) 
integer. Then W has approximately a Student t distribution with v degrees of freedom. 


Sy 
ae 


A 2 
Using 0 = 4, take v to be the expression rounded to the nearest 
Y 


Case Study 9.2.3 


Does size matter? While a successful company’s large number of sales should 
mean bigger profits, does it yield greater profitability? Forbes magazine period- 
ically rates the top two hundred small companies (52), and for each gives the 
profitability as measured by the five-year percentage return on equity. Using 
data from the Forbes article, Table 9.2.4 gives the return on equity for the twelve 
companies with the largest number of sales (ranging from $679 million to $738 
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million) and for the twelve companies with the smallest number of sales (rang- 
ing from $25 million to $66 million). Based on these data, can we say that the 
return on equity differs between the two types of companies? 


Table 9.2.4 
Return on Return on 
Large-Sales Companies Equity (%) Small-Sales Companies Equity (%) 
Deckers Outdoor 21 NVE 21 
Jos. A. Bank Clothiers 23 Hi-Shear Technology 21 
National Instruments 13 Bovie Medical 14 
Dolby Laboratories 22 Rocky Mountain Chocolate 31 
Factory 
Quest Software 7 Rochester Medical 19 
Green Mountain Coffee 17 Anika Therapeutics 19 
Roasters 
Lufkin Industries 19 Nathan’s Famous 11 
Red Hat 11 Somanetics 29 
Matrix Service 2 Bolt Technology 20 
DXP Enterprises 30 Energy Recovery 27 
Franklin Electric 15 Transcend Services 27 
LSB Industries 43 IEC Electronics 24 


Let jx and py be the respective average returns on equity. The indicated 
test of hypotheses is 


Ao: Wx = by 
versus 
Ay: x # [Ly 


For the data in the table, ¥ = 18.6, y=21.9, s% = 115.9929, and s? =35.7604. The 
test statistic is 


pe tT Went) _ 186-219 _ nae 
s2 2 le em . 
[5x 4 Sy 15.5020 4 35.7604 
Also, 
~ s2 115.9929 
§=“* =" = 3.244 
sz 35.7604 
SO 
2 
524442 


1 L (12)? 
4, (3.244)? + 4 (4) 
which implies that v = 17. 
We should reject Hp at the a = 0.05 level of significance if w > t0,925,.17 = 
2.1098 or w < —t0.025.17 = —2.1098. Here, w = —0.928 falls in between the two 
critical values, so the difference between X and y is not statistically significant. 
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Comment It occasionally happens that an experimenter wants to test Ho: wx = Uy 
and knows the values of of and oy. For those situations, the t test of Theorem 9.2.2 
is inappropriate. If the n X;’s and m Y;’s are normally distributed, it follows from the 


corollary to Theorem 4.3.3 that 


Z= um (9.2.1) 


has a standard normal distribution. Any such test of Ho: wx = y, then, should be 
based on an observed Z ratio rather than an observed t ratio. 

If the degrees of freedom for a t test exceed 100, then the test statistic of Equa- 
tion 9.2.1 is used, but it is treated as a Z ratio. In either the test of Theorem 9.2.2 
or 9.2.3, if the degrees of freedom exceed 100, the statistic of Theorem 9.2.3 is used 


with the z tables. 


Questions 


9.2.1. Some states that operate a lottery believe that 
restricting the use of lottery profits to supporting edu- 
cation makes the lottery more profitable. Other states 
permit general use of the lottery income. The profitabil- 
ity of the lottery for a group of states in each category is 
given below. 


State Lottery Profits 


For Education For General Use 


State % Profit State % Profit 
New Mexico 24 Massachusetts 21 
Idaho 25 Maine 22 
Kentucky 28 Iowa 24 
South Carolina 28 Colorado 27 
Georgia 28 Indiana 27 
Missouri 29 Dist. Columbia 28 
Ohio 29 Connecticut 29 
Tennessee 31 Pennsylvania 32 
Florida 31 Maryland 32 
California 35 

North Carolina 35 

New Jersey 35 


Source: New York Times, National Section, October 7, 2007, p. 14. 


Test at the a = 0.01 level whether the mean profit of states 
using the lottery for education is higher than that of states 
permitting general use. Assume that the variances of the 
two random variables are equal. 


9.2.2. As the United States has struggled with the grow- 
ing obesity of its citizens, diets have become big business. 
Among the many competing regimens for those seeking 
weight reduction are the Atkins and Zone diets. In a com- 
parison of these two diets for one-year weight loss, a study 
(59) found that seventy-seven subjects on the Atkins diet 
had an average weight loss of x = —4.7 kg and a sample 
standard deviation of sy = 7.05 kg. Similar figures for the 


seventy-nine people on the Zone diet were y = —1.6 kg 
and sy = 5.36 kg. Is the greater reduction with the Atkins 
diet statistically significant? Test for a =0.05. 


9.2.3. A medical researcher believes that women typi- 
cally have lower serum cholesterol than men. To test this 
hypothesis, he took a sample of 476 men between the ages 
of nineteen and forty-four and found their mean serum 
cholesterol to be 189.0 mg/dl with a sample standard devi- 
ation of 34.2. A group of 592 women in the same age range 
averaged 177.2 mg/dl and had a sample standard deviation 
of 33.3. Is the lower average for the women statistically 
significant? Set a = 0.05. 


9.2.4. In the academic year 2004-05, 1126 high school 
freshmen took the SAT Reasoning Test. On the Criti- 
cal Reasoning portion, this group had a mean score of 
491 with a standard deviation of 119. The following year, 
5042 sophomores (none of them in the 2004-05 freshmen 
group) scored an average of 498, with a standard deviation 
of 129. Is the higher average score for the sophomores a 
result of such factors as additional schooling and increased 
maturity or simply a random effect? Test at the a = 0.05 
level of significance. 


Source: College Board SAT, Total Group Profile Report, 
2008. 


9.2.5. The University of Missouri-St. Louis gave a vali- 
dation test to entering students who had taken calculus in 
high school. The group of ninety-three students receiving 
no college credit had a mean score of 4.17 on the vali- 
dation test with a sample standard deviation of 3.70. For 
the twenty-eight students who received credit from a high 
school dual-enrollment class, the mean score was 4.61 with 
a sample standard deviation of 4.28. Is there a significant 
difference in these means at the a = 0.01 level? 


Source: MAA Focus, December 2008, p. 19. 


9.2.6. Ring Lardner was one of this country’s most pop- 
ular writers during the 1920s and 1930s. He was also a 


chronic alcoholic who died prematurely at the age of forty- 
eight. The following table lists the life spans of some of 
Lardner’s contemporaries (36). Those in the sample on the 
left were all problem drinkers; they died, on the average, 
at age sixty-five. The twelve (sober) writers on the right 
tended to live a full ten years longer. Can it be argued that 
an increase of that magnitude is statistically significant? 
Test an appropriate null hypothesis against a one-sided 
H,. Use the 0.05 level of significance. (Note: The pooled 
sample standard deviation for these two samples is 13.9.) 


Authors Noted for Authors Not Noted for 
Alchohol Abuse Alchohol Abuse 
Age at Age at 
Name Death Name Death 
Ring Lardner 48 Carl Van Doren 65 
Sinclair Lewis 66 Ezra Pound 87 
Raymond Chandler 71 Randolph Bourne 32 
Eugene O’Neill 65 Van Wyck Brooks 77 
Robert Benchley 56 Samuel Eliot Morrison 89 
J.P. Marquand 67 John Crowe Ransom 86 
Dashiell Hammett 67 T.S. Eliot 77 
e.e. cummings 70 Conrad Aiken 84 
Edmund Wilson 77 Ben Ames Williams 64 
Average: 65.2 Henry Miller 88 
Archibald MacLeish 90 
James Thurber 67 
Average: 75.5 


9.2.7. Poverty Point is the name given to a num- 
ber of widely scattered archaeological sites throughout 
Louisiana, Mississippi, and Arkansas. These are the 
remains of a society thought to have flourished during the 
period from 1700 to 500 B.c. Among their characteristic 
artifacts are ornaments that were fashioned out of clay and 
then baked. The following table shows the dates (in years 
B.C.) associated with four of these baked clay ornaments 
found in two different Poverty Point sites, Terral Lewis 
and Jaketown (86). The averages for the two samples are 
1133.0 and 1013.5, respectively. Is it believable that these 
two settlements developed the technology to manufacture 
baked clay ornaments at the same time? Set up and test an 
appropriate Hy against a two-sided H, at the a =0.05 level 
of significance. For these data s, = 266.9 and s, = 224.3. 


Terral Lewis Estimates, x; Jaketown Estimates, y; 


1492 1346 
1169 942 
883 908 
988 858 


9.2.8. A major source of “mercury poisoning” comes 
from the ingestion of methylmercury (CH3°°), which is 
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found in contaminated fish (recall Question 5.3.3). Among 
the questions pursued by medical investigators trying to 
understand the nature of this particular health problem 
is whether methylmercury is equally hazardous to men 
and women. The following (114) are the half-lives of 
methylmercury in the systems of six women and nine men 
who volunteered for a study where each subject was given 
an oral administration of CH". Is there evidence here 
that women metabolize methylmercury at a different rate 
than men do? Do an appropriate two-sample ¢ test at the 
a = 0.01 level of significance. The two sample standard 
deviations for these data are sy = 15.1 and sy =8.1. 


Methylmercury (CH;”) Half-Lives (in Days) 


Females, x; Males, y; 

52 72 
69 88 
73 87 
88 74 
87 78 
56 70 

78 

93 

74 


9.2.9. Lipton, a company primarily known for tea, con- 
sidered using coupons to stimulate sales of its packaged 
dinner entrees. The company was particularly interested 
whether there was a diffences in the effect of coupons on 
singles versus married couples. A poll of consumers asked 
them to respond to the question “Do you use coupons 
regularly?” by a numerical scale, where 1 stands for agree 
strongly, 2 for agree, 3 for neutral, 4 for disagree, and 5 for 
disagree strongly. The results of the poll are given in the 
following table (19). 


Use Coupons Regularly 


Single (X) Married (Y) 
n=31 n=57 
x =3.10 y=2.43 
Sy = 1.469 Sy = 1.350 


Is the observed difference significant at the a =0.05 level? 


9.2.10. A company markets two brands of latex paint— 
regular and a more expensive brand that claims to dry 
an hour faster. A consumer magazine decides to test this 
claim by painting ten panels with each product. The aver- 
age drying time of the regular brand is 2.1 hours with a 
sample standard deviation of 12 minutes. The fast-drying 
version has an average of 1.6 hours with a sample stan- 
dard deviation of 16 minutes. Test the null hypothesis that 
the more expensive brand dries an hour quicker. Use a 
one-sided H,. Let a=0.05. 
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9.2.11. (a) Suppose Ao: wx = My is to be tested against 
Hy: fy # wy. The two sample sizes are 6 and 11. If s, = 
15.3, what is the smallest value for |x — y| that will result 
in Hp being rejected at the a=0.01 level of significance? 
(b) What is the smallest value for x — y that will lead to 
the rejection of Hp: wx = wy in favor of Ai: wx > py if 
a=0.05, sp = 214.9, n= 13, andm=8? 


9.2.12. Suppose that Ho: wy = py is being tested against 
Hy: jx # fy, where oy and of are known to be 17.6 and 
22.9, respectively. If n = 10, m = 20, x = 81.6, and y=79.9, 
what P-value would be associated with the observed Z 
ratio? 


9.2.13. An executive has two routes that she can take 
to and from work each day. The first is by interstate; the 
second requires driving through town. On the average it 
takes her 33 minutes to get to work by the interstate and 
35 minutes by going through town. The standard devia- 
tions for the two routes are 6 and 5 minutes, respectively. 
Assume the distributions of the times for the two routes 
are approximately normally distributed. 


(a) What is the probability that on a given day, driving 
through town would be the quicker of her choices? 

(b) What is the probability that driving through town 
for an entire week (ten trips) would yield a lower 
average time than taking the interstate for the entire 
week? 


9.2.14. Prove that the Z ratio given in Equation 9.2.1 has 
a standard normal distribution. 


9.2.15. If X|, Xo, ...,X, and Y;, Y2,..., Y,, are indepen- 
dent random samples from normal distributions with the 
same o”, prove that their pooled sample variance, Lee is an 
unbiased estimator for o. 


9.2.16. Let X,, X2,...,X, and Y,, Y, ..., Y,, be indepen- 
dent random samples drawn from normal distributions 
with means wy and py, respectively, and with the same 
known variance o”.Use the generalized likelihood ratio 
criterion to derive a test procedure for choosing between 
Ay: Wy = by and My: wy A py. 


9.2.17. A person exposed to an infectious agent, either by 
contact or by vaccination, normally develops antibodies 
to that agent. Presumably, the severity of an infection 
is related to the number of antibodies produced. The 
degree of antibody response is indicated by saying that 
the person’s blood serum has a certain titer, with higher 
titers indicating greater concentrations of antibodies. The 
following table gives the titers of twenty-two persons 
involved in a tularemia epidemic in Vermont (18). Eleven 
were quite ill; the other eleven were asymptomatic. Use an 
approximate ft ratio to test Ho: wx = Wy against a one-sided 
H, at the 0.05 level of significance. 

The sample standard deviations for the “Severely Il” 
and “Asymptomatic” groups are 428 and 183, respectively. 


Severely [Il Asymptomatic 
Subject Titer Subject Titer 
1 640 12 10 
2 80 13 320 
3 1280 14 320 
4 160 15 320 
5 640 16 320 
6 640 17 80 
7 1280 18 160 
8 640 19 10 
9 160 20 640 
10 320 21 160 
11 160 22 320 


9.2.18. For the approximate two-sample rt test described 
in Question 9.2.17, it will be true that 


v<n+m—2 


Why is that a disadvantage for the approximate test? That 
is, why is it better to use the Theorem 9.2.1 version of the 
t test if, in fact, of =o0/? 


9.2.19. The two-sample data described in Question 8.2.2 
would be analyzed by testing Hp: wx = my, where wy and 
f4y denote the true average motorcycle-related fatality 
rates for states having “limited” and “comprehensive” 
helmet laws, respectively. 


(a) Should the ¢ test for Ho: wx = “wy follow the for- 
mat of Theorem 9.2.2 or the approximation given in 
Theorem 9.2.3? Explain. 

(b) Is there anything unusual about these data? Explain. 


9.2.20. Some financial analysts believe that the election 
of a Republican president is good for the stock market. 
To test this claim, one study (155) recorded the ten-year 
growth in Standard & Poor’s index following each elec- 
tion of a new president. The results are given in the table 
below. 


Democrats Republicans 
Winner S&P Growth Winner S&P Growth 
Roosevelt ’36 22.4 Eisenhower 52 45.7 
Roosevelt ’40 24.0 Eisenhower 56 28.6 
Roosevelt ’44 38.0 Nixon ’68 14.2 
Truman *48 45.7 Nixon ’72 18.8 
Kennedy 60 21.2 Reagan ’80 50.3 
Johnson °64 17.9 Reagan 84 40.1 
Carter ’76 38.2 Bush ’88 52.4 
Clinton ’92 33.7 
Clinton ’96 23.8 


Is the higher average for the Republicans statistically 
significant? Test at the 0.01 level. Do not assume the 
variances are equal. 


Figure 9.3.1 Variability of 
machine outputs. 


Theorem 
9.3.1 
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9.3 Testing Ho: of = of/—The F Test 


Although by far the majority of two-sample problems are set up to detect pos- 
sible shifts in location parameters, situations sometimes arise where it is equally 
important— perhaps even more important —to compare variability parameters. Two 
machines on an assembly line, for example, may be producing items whose average 
dimensions (wx and jy) of some sort —say, thickness—are not significantly different 
but whose variabilities (as measured by o% and o7) are. This becomes a critical piece 
of information if the increased variability results in an unacceptable proportion of 
items from one of the machines falling outside the engineering specifications (see 
Figure 9.3.1). 


. Output from machine X 
Useep yee) peaper nce (Acceptable) proportion 


too thin x t 
\ too thick 


Ly 


Engineering 
specifications 


. | 
(Unacceptable) proportion Output from machine ¥ 


too thin x" "Y 1 
(Unacceptable) proportion 
too thick 


In this section we will examine the generalized likelihood ratio test of Ho: Cre = 
oy versus H: 03 #40;. The data will consist of two independent random sam- 
ples of sizes n and m: The first—x,,.x2,...,x,—is assumed to have come from a 
normal distribution having mean jzy and variance a the second— yy, y2,..., ¥n— 
from a normal distribution having mean py and variance o;. (All four param- 
eters are assumed to be unknown.) Theorem 9.3.1 gives the test procedure that 
will be used. The proof will not be given, but it follows the same basic pat- 
tern we have seen in other GLRTs; the important step is showing that the 
likelihood ratio is a monotonic function of the F random variable described in 
Definition 7.3.2. 


Comment Tests of Ho: o% = 07 arise in another, more routine context. Recall that 
the procedure for testing the equality of wy and zy depends on whether or not the 
two population variances are equal. This implies that a test of Ho: of = 07 should 
precede every test of Ho: 4x = wy. If the former is accepted, the ¢ test on wx and [Ly is 
done according to Theorem 9.2.2; but if Ho: of =o is rejected, Theorem 9.2.2 is not 
entirely appropriate. A frequently used alternative in that case is the approximate f 


test described in Theorem 9.2.3. 


Let x1, X2,...,%, and yy, yo,...;¥m be independent random samples from normal 
distributions with means {1x and jy and standard deviations ox and oy, respectively. 
a. To test Hy: 02 =0} versus H,: 0% > 0; at the a level of significance, reject Ho if 
27.2 
Sy/S¥ < Fajm—1,n-1+ 
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b. To test Ho: 0% =o; versus Hy: 0% <0; at the a level of significance, reject Ho if 
Seis = F\_,.m—1,n—1: 
c. To test Ho: 02 =o; versus H\: 0% #0; at the a level of significance, reject Ho if 


s2]s2, is either (1) < Fy /2,m—1,n-1 or (2) = F\~0/2,m—1,n-1- 


Comment The GLRT described in Theorem 9.3.1 is approximate for the same sort 
of reason the GLRT for Ho: o? =o, is approximate (see Theorem 7.5.2). The distri- 
bution of the test statistic, Se? Ss is not symmetric, and the two ranges of variance 
ratios yielding A’s less than or equal to A* (i.e., the left tail and right tail of the 
critical region) have slightly different areas. For the sake of convenience, though, 
it is customary to choose the two critical values so that each cuts off the same 
area, a/2. 


Case Study 9.3.1 


Electroencephalograms are records showing fluctuations of electrical activity 
in the brain. Among the several different kinds of brain waves produced, the 
dominant ones are usually alpha waves. These have a characteristic frequency 
of anywhere from eight to thirteen cycles per second. 

The objective of the experiment described in this example was to see 
whether sensory deprivation over an extended period of time has any effect on 
the alpha-wave pattern. The subjects were twenty inmates in a Canadian prison 
who were randomly split into two equal-sized groups. Members of one group 
were placed in solitary confinement; those in the other group were allowed 
to remain in their own cells. Seven days later, alpha-wave frequencies were 
measured for all twenty subjects (60), as shown in Table 9.3.1. 


Table 9.3.1 Alpha-Wave Frequencies (CPS) 
Nonconfined, x; Solitary Confinement, y; 
10.7 9.6 
10.7 10.4 
10.4 9.7 
10.9 10.3 
10.5 9.2 
10.3 9.3 
9.6 9.9 
11.1 9.5 
11.2 9.0 
10.4 10.9 


Judging from Figure 9.3.2, there was an apparent decrease in alpha-wave 
frequency for persons in solitary confinement. There also appears to have been 
an increase in the variability for that group. We will use the F test to determine 
whether the observed difference in variability (sj = 0.21 versus s; = 0.36) is 
statistically significant. 

(Continued on next page) 
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Figure 9.3.2 Alpha-wave frequencies (cps). 


Let of and oy denote the true variances of alpha-wave frequencies for 
nonconfined and solitary-confined prisoners, respectively. The hypotheses to be 
tested are 


6 hee Dens  5 0 
Ho: oy = oy 


versus 


My: 02 #02 
Let w = 0.05 be the level of significance. Given that 


10 10 
yoyH 1058 yx? 1121.26 
i=1 i=l 


10 
yy =97.8 dy? =959.70 


i=l i=l 
the sample variances become 
10(1121.26) — (105.8)? 


2 
= = 91 
5x 10(9) 


and 


10(959.70) — (97.8)? 
= ‘ )~ OTS)" _ 0.36 
10(9) 


Dividing the sample variances gives an observed F ratio of 1.71: 


ra 936 54, 
se (0.210° 


Both n and m are ten, so we would expect S}/S% to behave like an F ran- 
dom variable with nine and nine degrees of freedom (assuming Ho: 07 = 07 is 
true). From Table A.4 in the Appendix, we see that the values cutting off areas 
of 0.025 in either tail of that distribution are 0.248 and 4.03 (see Figure 9.3.3). 

Since the observed F ratio falls between the two critical values, our decision 
is to fail to reject Hy—a ratio of sample variances equal to 1.71 does not rule out 


(Continued on next page) 


the possibility that the two true variances are equal. (In light of the Comment 


the two-sample ¢ test described in Section 9.2.) 


Figure 9.3.3 Distribution of S;/S; when Ah is true. 


1, it would now be appropriate to test Ho: wy = Wy using 


F distribution with 
9 and 9 degrees 
of freedom 


Area = 0.025 
ria 


0.248 4.03 


Reject Hp 
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(Case Study 9.3.1 continued) 
preceding Theorem 9.3. 
Density 
Area = 0.025 
Reject Hp 
Questions 


9.3.1. Case Study 9.2.3 was offered as an example of test- 
ing means when the variances are not assumed equal. Was 
this a correct assumption about the variances? Test at the 
0.05 level of significance. 


9.3.2. Two popular forms of mortgage are the thirty-year 
fixed-rate mortgage, where the borrower has thirty years 
to repay the loan at a constant rate, and the adjustable- 
rate mortgage (ARM), one version of which is for five 
years with the possibility of yearly changes in the inter- 
est rate. Since the ARM offers less certainty, its rates are 
usually lower than those of fixed-rate mortgages. How- 
ever, such vehicles should show more variability in rates. 
Test this hypothesis at the 0.10 level of significance using 
the following samples of mortgage offerings for a loan 
of $160,000 (the borrower needs $200,000, but must pay 
$40,000 up front). 


$160,000 Mortgage Rates 


30- Year Fixed ARM 
5.500 3.875 
5.500 5.125 
5.250 5.000 
5.125 4.750 
5.875 4.375 
5.625 
5.250 
4.875 


9.3.3. Among the standard personality inventories used 
by psychologists is the thematic apperception test (TAT) 


in which a subject is shown a series of pictures and is asked 
to make up a story about each one. Interpreted properly, 
the content of the stories can provide valuable insights 
into the subject’s mental well-being. The following data 
show the TAT results for 40 women, 20 of whom were 
the mothers of normal children and 20 the mothers of 
schizophrenic children. In each case the subject was shown 
the same set of 10 pictures. The figures recorded were 
the numbers of stories (out of 10) that revealed a posi- 
tive parent-child relationship, one where the mother was 
clearly capable of interacting with her child in a flexible, 
open-minded way (199). 


TAT Scores 


Mothers of Normal Mothers of Schizophrenic 


Children Children 
8 4 6 5 1 2 ot 1 3 2 
4 4 6 4 2 A 2 1 3 1 
2 1 1 4 3 0 2 4 2 3 
3 2 6 3 4 3 0 1 2 2 
(a) Test Hy: 0% =o7 versus H,: 0; #0;, where of and 
oy are the variances of the scores of mothers of nor- 
mal children and scores of mothers of schizophrenic 
children, respectively. Let a = 0.05. 
(b) If Ho: 0% =0; is accepted in part (a), test Ho: wx = My 


versus H, : “xy # Ly. Set a equal to 0.05. 


9.3.4. In a study designed to investigate the effects of a 
strong magnetic field on the early development of mice 


(7), 10 cages, each containing three 30-day-old albino 
female mice, were subjected for a period of 12 days to 
a magnetic field having an average strength of 80 Oe/cm. 
Thirty other mice, housed in 10 similar cages, were not put 
in the magnetic field and served as controls. Listed in the 
table are the weight gains, in grams, for each of the 20 sets 
of mice. 


In Magnetic Field Not in Magnetic Field 


Cage Weight Gain(g) Cage Weight Gain (g) 


1 22.8 11 23.5 
2 10.2 12 31.0 
3 20.8 13 19.5 
4 27.0 14 26.2 
5 19.2 15 26.5 
6 9.0 16 25.2 
7 14.2 17 24.5 
8 19.8 18 23.8 
9 14.5 19 27.8 
10 14.8 20 22.0 


Test whether the variances of the two sets of weight gains 
are significantly different. Let a =0.05. For the mice in the 
magnetic field, sy = 5.67; for the other mice, sy = 3.18. 


9.3.5. Raynaud’s syndrome is characterized by the sud- 
den impairment of blood circulation in the fingers, a 
condition that results in discoloration and heat loss. The 
magnitude of the problem is evidenced in the following 
data, where twenty subjects (ten “normals” and ten with 
Raynaud’s syndrome) immersed their right forefingers in 
water kept at 19°C. The heat output (in cal/cm?/minute) of 
the forefinger was then measured with a calorimeter (105). 


Subjects with 


Normal Subjects Raynaud’s Syndrome 
Heat Output Heat Output 
Patient (cal/cm?/min) Patient (cal/cm?/min) 
W.K 2.43 R.A. 0.81 
MN. 1.83 R.M. 0.70 
S.A. 2.43 FM. 0.74 
Z.K. 2.70 K.A. 0.36 
JH. 1.88 H.M. 0.75 
J.G. 1.96 S.M. 0.56 
G.K 1.53 R.M. 0.65 
AS. 2.08 G.E. 0.87 
TE. 1.85 B.W. 0.40 
LE 2.44 NE 0.31 
x=2.11 y =0.62 
Sy =0.37 sy = 0.20 
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Test that the heat-output variances for normal sub- 
jects and those with Raynaud’s syndrome are the 
same. Use a two-sided alternative and the 0.05 level of 
significance. 


9.3.6. The bitter, eight-month baseball strike that ended 
the 1994 season so abruptly was expected to have sub- 
stantial repercussions at the box office when the 1995 
season finally got under way. It did. By the end of the 
first week of play, American League teams were play- 
ing to 12.8% fewer fans than the year before; National 
League teams fared even worse—their attendance was 
down 15.1% (190). Based on the team-by-team atten- 
dance figures given below, would it be appropriate to use 
the pooled two-sample t test of Theorem 9.2.2 to assess 
the statistical significance of the difference between those 
two means? 


American League National League 


Team Change Team Change 
Baltimore 2% Atlanta 49% 
Boston +16 Chicago -4 
California +7 Cincinnati -18 
Chicago 27 Colorado —27 
Cleveland Nohome games Florida -15 
Detroit 22 Houston -16 
Kansas City —20 Los Angeles -10 
Milwaukee -30 Montreal -1 
Minnesota -8 New York +34 
New York 2 Philadelphia -9 
Oakland No home games Pittsburgh 28 
Seattle -3 San Diego -10 
Texas -39 San Francisco —45 
Toronto —24 St. Louis -14 
Average: -12.8% Average: -15.1% 


9.3.7. For the data in Question 9.2.8, the sample variances 
for the methylmercury half-lives are 227.77 for the females 
and 65.25 for the males. Does the magnitude of that differ- 
ence invalidate using Theorem 9.2.2 to test Ho: wx = by? 
Explain. 


9.3.8. Crosstown busing to compensate for de facto segre- 
gation was begun on a fairly large scale in Nashville during 
the 1960s. Progress was made, but critics argued that too 
many racial imbalances were left unaddressed. Among 
the data cited in the early 1970s are the following figures, 
showing the percentages of African-American students 
enrolled in a random sample of eighteen public schools 
(165). Nine of the schools were located in predominantly 
African-American neighborhoods; the other nine, in pre- 
dominantly white neighborhoods. Which version of the 
two-sample ¢ test, Theorem 9.2.2 or the Behrens—Fisher 
approximation given in Theorem 9.2.3, would be more 
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appropriate for deciding whether the difference between 9.3.9. Show that the generalized likelihood ratio for 


2 2 


35.9% and 19.7% is statistically significant? Justify testing Hy: oy =o; versus MH: oy 4 oy as described in 


your answer. 


Theorem 9.3.1 is given by 


Schools in African-American Schools in White jee) a 25 ie 
Neighborhoods Neighborhoods (ntm)/2 [3 a) X (i-¥) 
L(@.) (m+n) i=1 j=l 
36% 1% a L(Q.) ~~ gn/2yym/2 : . j (m+n) /2 
28 14 dG =a ORE YW 
41 ia 2 - 
32 30 
46 29 
39 6 9.3.10. Let X,, X2,...,X, and Y,,¥2,...,Y,, be indepen- 
24 18 dent random samples from normal distributions with 
32 25 means jx and jy and standard deviations oy and oy, 
45 23 respectively, where x and jy are known. Derive the 
Average: 35.9% Average: 19.7% GLRT for Hb: Ox =o; versus H: OE > oO}. 


9.4 Binomial Data: Testing Ho: py = py 


Up to this point, the data considered in this chapter have been independent random 
samples of sizes n and m drawn from two continuous distributions —in fact, from two 
normal distributions. Other scenarios, of course, are quite possible. The X’s and Y’s 
might represent continuous random variables but have density functions other than 
the normal. Or they might be discrete. In this section we consider the most common 
example of this latter type: situations where the two sets of data are binomial. 


Applying the Generalized Likelihood Ratio Criterion 


Suppose that n Bernoulli trials related to treatment X have resulted in x successes, 
and that m (independent) Bernoulli trials related to treatment Y have yielded y 
successes. We wish to test whether py and py, the true probabilities of success for 
treatment X and treatment Y, are equal: 


Ho: px = py (=p) 
versus 
A: px # Py 


Let a be the level of significance. 
Following the notation used for GLRTs, the two parameter spaces here are 


o={(px, py):0< px =py <1} 
and 
Q={(px, py):0< px <1,0<py<}} 
Furthermore, the likelihood function can be written 


5 aes 


L=px(1— px)” *- py — py)” 


Theorem 
9.4.1 
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Setting the derivative of In L with respect to p(= px = py) equal to 0 and solving for 
Pp gives a not-too-surprising result—namely, 


eat 
~ntm 


e 


That is, the maximum likelihood estimate for p under Hp is the pooled success 
proportion. Similarly, solving dInL/dpy =0 and dInL/dpy = 0 gives the two origi- 
nal sample proportions as the unrestricted maximum likelihood estimates, for px 
and py: 


x 
PxXeg= 52 PY¥e = 
n m 


Putting p., px,, and py, back into L gives the generalized likelihood ratio: 


n+m—x—-y 


_ Le) _[ety/a+m]'™ [l-@+y/atm)] 
L(Q)  (x/ny¥ [1 = @/n) J" /my [1 - (/m)] 


Equation 9.4.1 is such a difficult function to work with that it is necessary to 
find an approximation to the usual generalized likelihood ratio test. There are sev- 
eral available. It can be shown, for example, that —2 In A for this problem has an 
asymptotic x* distribution with 1 degree of freedom (200). Thus, an approximate 
two-sided, a = 0.05 test is to reject Hp if —2 In A > 3.84. 

Another approach, and the one most often used, is to appeal to the central limit 
theorem and make the observation that 

reat nats 


n m 


var(¥ = Z) 


m 


(9.4.1) 


m—y 


has an approximate standard normal distribution. Under Hp, of course, 


and 


n m 


var (2 ~) = POaP? , =P 


_ (n+m)p(1— p) 
nm 
If p is now replaced by =, its maximum likelihood estimate under , we get the 


statement of Theorem 9.4.1. 


Let x and y denote the numbers of successes observed in two independent sets of n 
and m Bernoulli trials, respectively, where px and py are the true success probabilities 


associated with each set of trials. Let pe = ~~ and define 


— 
n 


De(1—Pe) De(1—Pe) 
Fi n as m 


S\x 


L= 


a. To test Ho: px = py versus H\: px > py at the a level of significance, reject Ho if 
22> Za 

b. To test Ho: px = py versus H,: px < py at the a level of significance, reject Ho if 
Zz < —fa- 
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c. To test Ho: px = py versus H\: px # py at the a level of significance, reject Ho if z 
is either (1) <—Zq/2 or (2) > Z/2. 


Comment The utility of Theorem 9.4.1 actually extends beyond the scope we have 
just described. Any continuous variable can always be dichotomized and “trans- 
formed” into a Bernoulli variable. For example, blood pressure can be recorded in 
terms of “mm Hg,” a continuous variable, or simply as “normal” or “abnormal,” a 
Bernoulli variable. The next two case studies illustrate these two sources of binomial 
data. In the first, the measurements begin and end as Bernoulli variables; in the sec- 
ond, the initial measurement of “number of nightmares per month” is dichotomized 
into “often” and “seldom.” 


Case Study 9.4.1 


Until almost the end of the nineteenth century, the mortality associated with 
surgical operations—even minor ones—was extremely high. The major prob- 
lem was infection. The germ theory as a model for disease transmission was still 
unknown, so there was no concept of sterilization. As a result, many patients 
died from postoperative complications. 

The major breakthrough that was so desperately needed finally came when 
Joseph Lister, a British physician, began reading about some of the work done 
by Louis Pasteur. In a series of classic experiments, Pasteur had succeeded 
in demonstrating the role that yeasts and bacteria play in fermentation. Lis- 
ter conjectured that human infections might have a similar organic origin. To 
test his theory he began using carbolic acid as an operating-room disinfectant. 
He performed forty amputations with the aid of carbolic acid, and thirty-four 
patients survived. He also did thirty-five amputations without carbolic acid, and 
nineteen patients survived. While it seems clear that carbolic acid did improve 
survival rates, a test of statistical significance helps to rule out a difference due 
to chance (202). 

Let px be the true probability of survival with carbolic acid, and let py 
denote the true survival probability without the antiseptic. The hypotheses to 
be tested are 


Ho: px = py (=P) 


versus 
Ay: px > py 
Take a =0.01. 
If Ho is true, the pooled estimate of p would be the overall survival rate. 

That is, 

34+19 53 

Pe= = = — =0.707 
40+35 75 


The sample proportions for survival with and without carbolic acid are 34/40 = 
0.850 and 19/35 =0.543, respectively. According to Theorem 9.4.1, then, the test 
statistic is 

0.850 — 0.543 


z= = 2.92 
(0.707)(0.293) 4 (0.707)(0.293) 
40 35 


Since z exceeds the a = 0.01 critical value (zo; = 2.33), we should reject the null 
hypothesis and conclude that the use of carbolic acid saves lives. 
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About the Data _ In spite of this study and a growing body of similar evidence, the 
theory of antiseptic surgery was not immediately accepted in Lister’s native Eng- 
land. Continental European surgeons, though, understood the value of Lister’s work 
and in 1875 presented him with a humanitarian award. 


Case Study 9.4.2 


Over the years, numerous studies have sought to characterize the nightmare 
sufferer. Out of these has emerged the stereotype of someone with high anxi- 
ety, low ego strength, feelings of inadequacy, and poorer-than-average physical 
health. What is not so well known, though, is whether men fall into this pattern 
with the same frequency as women. To this end, a clinical survey (77) looked at 
nightmare frequencies for a sample of 160 men and 192 women. Each subject 
was asked whether he (or she) experienced nightmares “often” (at least once 
a month) or “seldom” (less than once a month). The percentages of men and 
women saying “often” were 34.4% and 31.3%, respectively (see Table 9.4.1). 
Is the difference between those two percentages statistically significant? 


Table 9.4.1 Frequency of Nightmares 


Men Women Total 


Nightmares often 55 60 115 
Nightmares seldom 105 132 237 


Totals 160. 192 
% often: 34.4 31.3 


Let py and pw denote the true proportions of men having nightmares 
often and women having nightmares often, respectively. The hypotheses to be 
tested are 


Ho: Pm = Pw 


versus 
Ay: pu # Pw 
Let a=0.05. Then + zo25 =+ 1.96 become the two critical values. Moreover, 
Pe = Rt, = 0.327, so 
0.344 — 0.313 


L= 
(0.327)(0.673) 4 (0.327)(0.673) 
160 192 


= 0.62 


The conclusion, then, is clear: We fail to reject the null hypothesis—these data 
provide no convincing evidence that the frequency of nightmares is different for 
men than for women. 


About the Data The results of every statistical study are intended to be 
generalized—from the subjects measured to a broader population that the sample 
might reasonably be expected to represent. Obviously, then, knowing something 
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about the subjects is essential if a set of data is to be interpreted (and extrapolated) 
properly. Table 9.4.1 is a cautionary case in point. The 352 individuals interviewed 
were not the typical sort of subjects solicited for a university research project. They 
were all institutionalized mental patients. 


Questions 


9.4.1. The phenomenon of handedness has been exten- 
sively studied in human populations. The percentages of 
adults who are right-handed, left-handed, and ambidex- 
trous are well documented. What is not so well known is 
that a similar phenomenon is present in lower animals. 
Dogs, for example, can be either right-pawed or left- 
pawed. Suppose that in a random sample of 200 beagles, it 
is found that 55 are left-pawed and that in a random sam- 
ple of 200 collies, 40 are left-pawed. Can we conclude that 
the difference in the two sample proportions of left-pawed 
dogs is statistically significant for a =0.05? 


9.4.2. In a study designed to see whether a controlled 
diet could retard the process of arteriosclerosis, a total of 
846 randomly chosen persons were followed over an eight- 
year period. Half were instructed to eat only certain foods; 
the other half could eat whatever they wanted. At the end 
of eight years, 66 persons in the diet group were found 
to have died of either myocardial infarction or cerebral 
infarction, as compared to 93 deaths of a similar nature in 
the control group (203). Do the appropriate analysis. Let 
a =0.05. 


9.4.3. Water witching, the practice of using the move- 
ments of a forked twig to locate underground water (or 
minerals), dates back over 400 years. Its first detailed 
description appears in Agricola’s De re Metallica, pub- 
lished in 1556. That water witching works remains a belief 
widely held among rural people in Europe and through- 
out the Americas. [In 1960 the number of “active” water 
witches in the United States was estimated to be more 
than 20,000 (193).] Reliable evidence supporting or refut- 
ing water witching is hard to find. Personal accounts of 
isolated successes or failures tend to be strongly biased 
by the attitude of the observer. Of all the wells dug in 
Fence Lake, New Mexico, 29 “witched” wells and 32 “non- 
witched” wells were sunk. Of the “witched” wells, 24 
were successful. For the “nonwitched” wells, there were 
27 successes. What would you conclude? 


9.4.4. If flying saucers are a genuine phenomenon, it 
would follow that the nature of sightings (that is, their 
physical characteristics) would be similar in different parts 
of the world. A prominent UFO investigator compiled a 
listing of 91 sightings reported in Spain and 1117 reported 
elsewhere. Among the information recorded was whether 
the saucer was on the ground or hovering. His data are 
summarized in the following table (87). Let ps and pws 
denote the true probabilities of “Saucer on ground” in 


Spain and not in Spain, respectively. Test Ho: ps = pws 
against a two-sided H,. Let a=0.01. 


In Spain Not in Spain 


705 
412 


Saucer on ground 53 
Saucer hovering 38 


9.4.5. In some criminal cases, the judge and the defen- 
dant’s lawyer will enter into a plea bargain, where the 
accused pleads guilty to a lesser charge. The proportion of 
time this happens is called the mitigation rate. A Florida 
Corrections Department study showed that Escambia 
County had the state’s fourth highest rate, 61.7% (1033 
out of 1675 cases). Concerned that the guilty were not get- 
ting appropriate sentences, the state attorney put in new 
policies to limit the number of plea bargains. A follow- 
up study (133) showed that the mitigation rate dropped 
to 52.1% (344 out of 660 cases). Is it fair to conclude that 
the drop was due to the new policies, or can the decline be 
written off to chance? Test at the wa =0.01 level. 


9.4.6. Suppose Hy: py = py is being tested against 
Hi: px # py on the basis of two independent sets of one 
hundred Bernoulli trials. If x, the number of successes 
in the first set, is sixty and y, the number of successes 
in the second set, is forty-eight, what P-value would be 
associated with the data? 


9.4.7. A total of 8605 students are enrolled full-time at 
State University this semester, 4134 of whom are women. 
Of the 6001 students who live on campus, 2915 are women. 
Can it be argued that the difference in the proportion of 
men and women living on campus is statistically signifi- 
cant? Carry out an appropriate analysis. Let a = 0.05. 


9.4.8. The kittiwake is a seagull whose mating behavior 
is basically monogamous. Normally, the birds separate for 
several months after the completion of one breeding sea- 
son and reunite at the beginning of the next. Whether or 
not the birds actually do reunite, though, may be affected 
by the success of their “relationship” the season before. 
A total of 769 kittiwake pair-bonds were studied (30) over 
the course of two breeding seasons; of those 769, some 609 
successfully bred during the first season; the remaining 160 
were unsuccessful. The following season, 175 of the previ- 
ously successful pair-bonds “divorced,” as did 100 of the 
160 whose prior relationship left something to be desired. 
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Can we conclude that the difference in the two divorce 9.4.9. A utility infielder for a National League club batted 
rates (29% and 63%) is statistically significant? .260 last season in three hundred trips to the plate. This 


year he hit .250 in two hundred at-bats. The owners are 
trying to cut his pay for next year on the grounds that his 


Breeding in Previous Year output has deteriorated. The player argues, though, that 


Successful Unsuccessful 


his performances the last two seasons have not been sig- 
nificantly different, so his salary should not be reduced. 


Number divorced 

Number not divorced 
Total 

Percent divorced 


Theorem 
9.5.1 


a4, _60 9.4.10. Compute —2 In A (see Equation 9.4.1) for the 
609 160 nightmare data of Case Study 9.4.2, and use it to test the 
29 63 hypothesis that py = py. Let a =0.01. 


9.5 Confidence Intervals for the Two-Sample Problem 


Two-sample data lend themselves nicely to the hypothesis testing format because 
a meaningful Hp can always be defined (which is not the case for every set of one- 
sample data). The same inferences, though, can just as easily be phrased in terms of 
confidence intervals. Simple inversions similar to the derivation of Equation 7.4.1 
will yield confidence intervals for 7 — wy, 07/07, and py — py. 


Let x1, X2,...,;%, and y\, y2,..., Ym be independent random samples drawn from nor- 
mal distributions with means jx and Ly, respectively, and with the same standard 
deviation, o. Let s, denote the data’s pooled standard deviation. A 100(1 — a)% 
confidence interval for uy — fy is given by 


ae 1 tole 1 1 
(5 5-tennenass ear » X— V+ la/2,n+m-2° Sp 1+2) 


Proof We know from Theorem 9.2.1 that 
X—Y — (ux — py) 
Spft+d 


has a Student ¢ distribution with n + m — 2 df. Therefore, 


X—Y— (ux — by) 
P —te/2, n+m—2 < < ty /2, n+m—2 | = l-a (9.5.1) 


ses 


m 


Rewriting Equation 9.5.1 by isolating 4x — jy in the center of the inequalities gives 
the endpoints stated in the theorem. 


Case Study 9.5.1 


Case Study 8.2.2 made the claim that X-rays penetrate the tooth enamel of men 
and women differently, a fact that allows dental structure to help identify the 
sex of badly decomposed bodies. In this case study, the statistical analysis for 


(Continued on next page) 
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(Case Study 9.5.1 continued) 


that assertion is provided. Moreover, the resulting confidence interval gives an 
estimate of the difference in the mean enamel spectropenetration gradients for 
the two sexes. 

Listed in Table 9.5.1 (and Table 8.2.2) are the gradients for eight female 
teeth and eight male teeth (57). These numbers are measures of the rate of 
change in the amount of X-ray penetration through a 500-micron section of 
tooth enamel at a wavelength of 600 nm as opposed to 400 nm. 


Table 9.5.1 Enamel Spectropenetration Gradients 


Male, x; Female, y; 


Let zy and zy be the population means of the spectropenetration gradients 
associated with male teeth and with female teeth, respectively. Note that 


8 8 
oxi =43.4 and Yo x7 = 239.32 
i=1 i=1 


from which 
_ 43.4 
x= —=54 
8 
and 
8 (239.32) — (43.4)? 
st= ( ec 
8(7) 
Similarly, 
8 8 
> yi =36.1 and YS y? = 166.95 
i=1 i=1 
so that 
a. 361 <a 
ane Wa 
and 
8(166.95) — (36.1)? 
2 
"r 8(7) 


Therefore, the pooled standard deviation is equal to 0.75: 


, ao Wa 
E Si 82 , ; 
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We know that the ratio 


X —Y—(ux—- py) 
sh 


will be approximated by a Student ¢ curve with 14 degrees of freedom. Since 
toos,14 = 2.1448, the 95% confidence interval for zx — wy is given by 


1 1 
(s-5-21dse op[ 4 7 3 742.4i8op [i +3) 


= [5.4 — 4.5 — 2.1448(0.75)V0.25, 5.4—4.5+ 2.1448(0.75) 0.25] 
= (0.1, 1.7) 


Comment Here the 95% confidence interval does not include the value 0. This 
means that had we tested 


versus 


Ao: Wx = [Ly 


Ay: wx A by 
at the a = 0.05 level of significance, Hp would have been rejected. 


Comment For the scenario of Theorem of 9.5.1, if the variances are not equal, then 
an approximate 100(1 — w)% confidence interval is given by 


bree 2 Hee 2 
Ss S S AY 

xX Ys Xx Y 
X—y tu /2,v n »X—y tu /2,v n 


(Hi) ce 
where v = =——7—,; ford = 4 
wnt wn Ga) *y : ; 
If the degrees or freedom exceed 100, then the form above is used, with zy/2 
replacing fy/2,v. 
Theorem Let x1, x2, 
9.5.2 


.; Xp, and y, y2,---; Ym be independent random samples drawn from nor- 
mal distributions with standard deviations ox and oy, respectively. A 100(1 — a)% 
confidence interval for the variance ratio, 02/07, is given by 
2 2 
Sx 
Ss 


F, “x FF 
79 Fa/2,m—1n—-l> ~7 © 1—-a/2,m—1,n—-1 
Y 


ne) 


has an F distribution with m — 1 and n — 1 df, 
and follow the strategy used in the proof of Theorem 9.5.1—that is, isolate 0; /o7 in 
the center of the analogous inequalities. 


2 2 
Proof Start with the fact that 3/24 
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Case Study 9.5.2 


The easiest way to measure the movement, or flow, of a glacier is with a cam- 
era. First a set of reference points is marked off at various sites near the 
glacier’s edge. Then these points, along with the glacier, are photographed 
from an airplane. The problem is this: How long should the time interval 
be between photographs? If too short a period has elapsed, the glacier will 
not have moved very far and the errors associated with the photographic 
technique will be relatively large. If too long a period has elapsed, parts 
of the glacier might be deformed by the surrounding terrain, an eventual- 
ity that could introduce substantial variability into the point-to-point velocity 
estimates. 

Two sets of flow rates for the Antarctic’s Hoseason Glacier have been cal- 
culated (115), one based on photographs taken three years apart, the other, five 
years apart (see Table 9.5.2). On the basis of other considerations, it can be 
assumed that the “true” flow rate was constant for the eight years in question. 


Table 9.5.2 Flow Rates Estimated for the Hoseason 
Glacier (Meters per Day) 
Three-Year Span, x; Five-Year Span, y; 
0.73 0.72 
0.76 0.74 
0.75 0.74 
0.77 0.72 
0.73 0.72 
0.75 
0.74 


The objective here is to assess the relative variabilities associated with the 
three- and five-year time periods. One way to do this— assuming the data to be 
normal—is to construct, say, a 95% confidence interval for the variance ratio. 
If that interval does not contain the value 1, we infer that the two time periods 
lead to flow rate estimates of significantly different precision. 

From Table 9.5.2, 


9) aL 
Yo xi=5.23 and > x7 =3.9089 
i=1 i=1 


so that 


7(3. — (5.23)? 
= ene O28) = 0.000224 
7(6) 


Similarly, 


5 5 
S>yi=3.64 and Sy? =2.6504 


i=] i=l 
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Theorem 
9.5.3 
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making 


5(2.6504) — (3.64)? 
5(4) 


The two critical values come from Table A.4 in the Appendix: 


= 0.000120 


Dr 
Sy= 


F025,4,6=9.109 and F'975,4,6 = 6.23 


Substituting, then, into the statement of Theorem 9.5.2 gives (0.203, 11.629) as 
a 95% confidence interval for of /oy: 

0.000224 | 109 0.000224 93) = (0.203, 11.629) 

0.000120" ~~’ 0.000120 ~~ J 
Thus, although the three-year data have a larger sample variance than the five- 
year data, no conclusions can be drawn about the true variances being different, 
because the ratio 0/07 = 1 is contained in the confidence interval. 


Let x and y denote the numbers of successes observed in two independent sets of n 
and m Bernoulli trials, respectively. If px and py denote the true success probabilities, 
an approximate 100(1 — a)% confidence interval for px — py is given by 


7 cam] DOR B) , @U= a) 


n m 


Yaa [oe ), GO-%) 


n m 


Proof See Question 9.5.11. 


Case Study 9.5.3 


If a hospital patient’s heart stops, an emergency message, code blue, is called. 
A team rushes to the bedside and attempts to revive the patient. A study (131) 
suggests that patients are better off not suffering cardiac arrest after 11 p.m., the 
so-called graveyard shift. The study lasted seven years and used non-emergency 
room data from over five hundred hospitals. During the day and early evening 
hours, 58,593 cardiac arrests occurred and 11,604 patients survived to leave the 
hospital. For the 11 p.m. shift, of the 28,155 heart stoppages, 4139 patients lived 
to be discharged. 

Let px (estimated by 11,604/58,593 = 0.198) be the true probability of sur- 
vival during the earlier hours. Let py denote the true survival probability for the 
graveyard shift (estimated by 4139/28, 155 = 0.147). To construct a 95% confi- 
dence interval for px — py, take Zy/2 = 1.96. Then Theorem 9.5.3 gives the lower 
limit of the confidence interval as 


(0.198)(0.802)  (0.147)(0.853) 
58,593 28,155 


(Continued on next page) 


= 0.0458 


0.198 — 0.147 196 
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(Case Study 9.5.3 continued) 


and the upper limit as 


shift. 


0.198 — 0.147 + 196 


so the 95% confidence interval is (0.0458, 0.0562). 
Since px — py = 0 is not included in the interval (which lies entirely to the 
right of 0), we can conclude that survival rates are worse during the graveyard 


(0.198) (0.802) 
58,593 


(0.147) (0.853) 
28,155 


= 0.0562 


Questions 


9.5.1 In 1965 a silver shortage in the United States 
prompted Congress to authorize the minting of silverless 
dimes and quarters. They also recommended that the sil- 
ver content of half-dollars be reduced from 90% to 40%. 
Historically, fluctuations in the amount of rare metals 
found in coins are not uncommon (76). The following data 
may be a case in point. Listed are the silver percentages 
found in samples of a Byzantine coin minted on two sep- 
arate occasions during the reign of Manuel I (1143-1180). 
Construct a 90% confidence interval for wy — py, the true 
average difference in the coin’s silver content (= “early” — 
“late”). What does the interval imply about the outcome 
of testing Ho: x = Wy? For these data sy =0.54 and sy = 
0.36. 


Early Coinage, x; Late Coinage, y; 


(% Ag) (% Ag) 
5.9 5.3 
6.8 5.6 
6.4 Ryo) 
7.0 5.1 
6.6 6.2 
7.7 5.8 
7.2 5.8 
6.9 
6.2 

Average: 6.7 Average: 5.6 


9.5.2 Male fiddler crabs solicit attention from the oppo- 
site sex by standing in front of their burrows and waving 
their claws at the females who walk by. If a female likes 
what she sees, she pays the male a brief visit in his bur- 
row. If everything goes well and the crustacean chemistry 
clicks, she will stay a little longer and mate. In what may 
be a ploy to lessen the risk of spending the night alone, 
some of the males build elaborate mud domes over their 
burrows. Do the following data (215) suggest that a male’s 
time spent waving to females is influenced by whether his 


burrow has a dome? Answer the question by constructing 
and interpreting a 95% confidence interval for wy — py. 
Use the value s, = 11.2. 


% of Time Spent Waving to Females 


Males with Domes, x; Males without Domes, y; 


100.0 76.4 
58.6 84.2 
93.5 96.5 
83.6 88.8 
84.1 85.3 

79.1 
83.6 


9.5.3 Construct two 99% confidence intervals for wx — by 
using the data of Case Study 9.2.3, first assuming the 
variances are equal, and then assuming they are not. 


9.5.4 Carry out the details to complete the proof of 
Theorem 9.5.1. 


9.5.5 Suppose that X,, X., ...,X, and Y,,%, ...,Y, are 
independent random samples from normal distributions 
with means wy and wy and known standard deviations 
ox and oy, respectively. Derive a 100(1 — a)% confidence 
interval for wy — py. 


9.5.6 Construct a 95% confidence interval for of/o; 
based on the data in Case Study 9.2.1. The hypothesis test 
referred to tacitly assumed that the variances were equal. 
Does that agree with your confidence interval? Explain. 


9.5.7 One of the parameters used in evaluating myocar- 
dial function is the end diastolic volume (EDV). The fol- 
lowing table shows EDVs recorded for eight persons con- 
sidered to have normal cardiac function and for six with 
constrictive pericarditis (192). Would it be correct to use 
Theorem 9.2.2 to test Ho: wx = wy? Answer the question 
by constructing a 95% confidence interval for of /o;. 
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Normal, x; Constrictive Pericarditis, y; 


62 24 
60 56 
78 42 
62 74 
49 44 
67 28 
80 

48 


9.5.8 Complete the proof of Theorem 9.5.2. 


9.5.9 Flonase is a nasal spray for diminishing nasal allergic 
symptoms. In clinical trials for side effects, 782 sufferers 
from allergic rhinitis were given a daily dose of 200 mcg of 
Flonase. Of this group, 126 reported headaches. A group 
of 758 subjects were given a placebo, and 111 of them 
reported headaches. Find a 95% confidence interval for 
the difference in proportion of headaches for the two 
groups. Does the confidence interval suggest a statistically 
significant difference in the frequency of headaches for 
Flonase users? 


Source: http://www.drugs.com/sfx/flonase-side-effects. html. 


9.5.10 Construct an 80% confidence interval for the 
difference py — pw in the nightmare frequency data 
summarized in Case Study 9.4.2. 


9.5.11 If px and py denote the true success probabilities 
associated with two sets of n and m independent Bernoulli 
trials, respectively, the ratio 


*= 1 (pe pr) 


«Ce _ (/m)(1—¥/m) 
T 


n m 


has approximately a standard normal distribution. Use 
that fact to prove Theorem 9.5.3. 


9.5.12 Suicide rates in the United States tend to be much 
higher for men than for women, at all ages. That pat- 
tern may not extend to all professions, though. Death 
certificates obtained for the 3637 members of the Ameri- 
can Chemical Society who died over a twenty-year period 
revealed that 106 of the 3522 male deaths were suicides, as 
compared to 13 of the 115 female deaths (101). Construct 
a 95% confidence interval for the difference in suicide 
rates. What would you conclude? 


9.6 Taking a Second Look at Statistics (Choosing 
Samples) 


Choosing sample sizes is a topic that invariably receives extensive coverage when- 
ever applied statistics and experimental design are discussed. For good reason. 
Whatever the context, the number of observations making up a data set figures 
prominently in the ability of those data to address any and all of the questions raised 
by the experimenter. As sample sizes get larger, we know that estimators become 
more precise and hypothesis tests get better at distinguishing between Hp and H). 
Larger sample sizes, of course, are also more expensive. The trade-off between how 
many observations researchers can afford to take and how many they would like to 
take is a choice that has to be made early on in the design of any experiment. If the 
sample sizes ultimately decided upon are too small, there is a risk that the objec- 
tives of the study will not be fully achieved—parameters may be estimated with 
insufficient precision and hypothesis tests may reach incorrect conclusions. 

That said, choosing sample sizes is often not as critical to the success of an exper- 
iment as choosing sample subjects. In a two-sample design, for example, how should 
we decide which particular subjects to assign to treatment X and which to treatment 
Y? If the subjects comprising a sample are somehow “biased” with respect to the 
measurement being recorded, the integrity of the conclusions is irretrievably com- 
promised. There are no statistical techniques for “correcting” inferences based on 
measurements that were biased in some unknown way. It is also true that biases can 
be very subtle, yet still have a pronounced effect on the final measurements. That 
being the case, it is incumbent on researchers to take every possible precaution at 
the outset to prevent inappropriate assignments of subjects to treatments. 

For example, suppose for your Senior Project you plan to study whether a new 
synthetic testosterone can affect the behavior of female rats. Your intention is to set 
up a two-sample design where ten rats will be given weekly injections of the new 
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Figure 9.6.1 


testosterone compound and another ten rats will serve as a control group, receiving 
weekly injections of a placebo. At the end of eight weeks, all twenty rats will be put 
in a large community cage, and the behavior of each one will be closely monitored 
for signs of aggression. 

Last week you placed an order for twenty female Rattus norvegicus from the 
local Rats ’R Us franchise. They arrived today, all housed in one large cage. Your 
plan is to remove ten of the twenty “at random,” and then put those ten in a similarly 
large cage. The ten removed will be receiving the testosterone injections; the ten 
remaining in the original cage will constitute the control group. The question is, 
which ten should be removed? 

The obvious answer—reach in and pull out ten—is very much the wrong answer! 
Why? Because the samples formed in such a way might very well be biased if, for 
example, you (understandably) tended to avoid grabbing the rats that looked like 
they might bite. If that were the case, the ones you drew out would be biased, by 
virtue of being more passive than the ones left behind. Since the measurements ulti- 
mately to be taken deal with aggression, biasing the samples in that particular way 
would be a fatal flaw. Whether the total sample size was twenty or twenty thousand, 
the results would be worthless. 

In general, relying on our intuitive sense of the word “random” to allocate sub- 
jects to different treatments is risky, to say the least. The correct approach would 
be to number the rats from 1 to 20 and then use a random number table or a com- 
puter’s random number generator to identify the ten to be removed. Figure 9.6.1 
shows the Minitab syntax for choosing a random sample of ten numbers from the 
integers 1 through 20. According to this particular run of the SAMPLE routine, the 
ten rats to be removed for the testosterone injections are (in order) numbers 1, 5, 8, 
9, 10, 14, 15, 18, 19 and 20. 


MTB > set cl 

DATA > 1:20 

DATA > end 

MTB > sample 10 cl c2 
MTB > print c2 

Data Display 


C2. 18. 1. 20 19° 9 10: 8 15 24 5 


There is a moral here. Designing, carrying out, and analyzing an experiment is 
an exercise that draws on a variety of scientific, computational, and statistical skills, 
some of which may be quite sophisticated. No matter how well those complex issues 
are attended to, though, the enterprise will fail if the simplest and most basic aspects 
of the experiment—such as assigning subjects to treatments—are not carefully 
scrutinized and properly done. The Devil, as the saying goes, is in the details. 


Appendix 9.A.1 A Derivation of the Two-Sample t Test (A Proof of Theorem 9.2.2) 


To begin, we note that both the restricted and unrestricted parameter spaces, w and 
Q, are three dimensional: 


w= {(Ux, My, 0): —00 < Wy = My <00,0 <0 < co} 
and 


Q= {(Ux, “My, 7): —00 < Lx < 00, —00 < fy <00,0<0 < oo} 
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Since the X’s and Y’s are independent (and normal), 


L@) =| [fo] [fon 


i=1 j=l 


n m 


1 os 1 2 2 
1 ewer ee a — 9.A.1.1 
(=) EXP) 55 | Le i 1) ( ) 


i=1 


where «4 = Ly = My. If we take In L(w) and solve dln L(w)/du =O0 and dln 
L(@)/do07 = 0 simultaneously, the solutions will be the restricted maximum likeli- 
hood estimates: 


m 


eae Ly) 
i=l j=1 
i (9.A.1.2) 


n+m 


and 


Y (x) = Me)? +O (yj = Me)” 
o2 = = (9.A.1.3) 


n+m 


Substituting Equations 9.A.1.2 and 9.A.1.3 into Equation 9.A.1.1 gives the numera- 
tor of the generalized likelihood ratio: 


(n+m)/2 
ee! 
L(@) = (5 
We 


Similarly, the likelihood function unrestricted by the null hypothesis is 
L(Q) = (=)~ exp sete! eo Ses) — py) (9.A.1.4) 
J 200 207") <= ar : 


Here, solving 


dInL(Q d In L(Q dIn L(Q 
nL(Q) _ 9 alnL(Q) _. almLiQ) _| 


Ox Oty do? 
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gives 


bx, =X py, =y 
. =\2 as m9 
> @ -—*)* + Ey; - 9) 

> i=l j=l 
59, 


n+m 


If these estimates are substituted into Equation 9.A.1.4, the maximum value for 
L(Q) simplifies to 
L(Q.) = (7! /2203,) 00” 


It follows, then, that the generalized likelihood ratio, i, is equal to 


(n+m)/2 
_ L) _(%.\r"" 
~ L(Q) \o2 


or, equivalently, 


p2/(nt+m) _ i=l j=I 


7 nx +my . a nx +my 2 
> [x -( vtnt)] +¥[y-( n+m )| 
i= = 


we can write A2/“+™ as 


y2/ (atm) ner i=l j=l 
ER -~\2 in =\2 nm (= \2 
Gia) + oO -9) ta ed 
i=l j=l 
_ 1 
= ie (*-y)? 
[So s-9)*[(2+2) 
i=l j=l 
n+m—2 


- (3) 
ae TTT ECVE 


where s, is the pooled variance: 


1 2 ” -\2 
c=) Ne 
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Therefore, in terms of the observed t ratio, 47/"+" simplifies to 


n+m—2 


2/(atm) 
n+m—24+?2 


(9.4.1.5) 


At this point the proof is almost complete. The generalized likelihood ratio cri- 
terion, rejecting Ho: wx = wy when 0 <A <A", is clearly equivalent to rejecting the 
null hypothesis when 0 < 47/“"*” < )**, But both of these, from Equation 9.A.1.5, 
are the same as rejecting Hy when 7? is too large. Thus the decision rule in terms 
of t? is 


Reject Ho: wx = wy in favor of Hy: wy A by if? > 1°? 


Or, phrasing this in still another way, we should reject Hp if either t > ¢t* or t < —f*, 
where 


P(-t* <T <t*| Ao: wx = py is true) = 1 —a 


By Theorem 9.2.1, though, T has a Student rf distribution with n + m — 2 df, which 
makes +¢* = +fy/2,n4m—2, and the theorem is proved. 


Appendix 9.A.2 Minitab Applications 


Figure 9.A.2.1 


Minitab has a simple command—TWOSAMPLE C1 C2—for doing a two-sample 
t test on a set of x;’s and y;’s stored in columns C1 and C2, respectively. The same 
command automatically constructs a 95% confidence interval for wx — wy. 


MTB > set cl 

DATA > 0.225 0.262 0.217 0.240 0.230 0.229 0.235 0.217 
DATA > end 

MTB > set c2 

DATA > 0.209 0.205 0.196 0.210 0.202 0.207 0.224 0.223 
DATA > 0.220 0.201 

DATA > end 

MTB > name cl ‘X’ c2 ‘yY’ 

MTB > twosample cl c2; 

SUBC > pooled. 


Two-Sample T-Test and CI: X, Y 


Two-sample T for X vs Y 


N Mean StDev SE Mean 
x 8 0.2319 0.0146 0.0051 
Y 10 0.20970 0.00966 0.0031 
Difference = mu (X) - mu (Y) 


Estimate for difference: 0.02217 

95% CI for difference: (0.01005, 0.03430) 

T-Test of difference = 0 (vs not =): T-Value = 3.88 P-Value = 0.001 DF = 16 
Both use Pooled StDev = 0.0121 


Figure 9.A.2.1 shows the syntax for analyzing the Quintus Curtius Snodgrass 
data in Table 9.2.1. Notice that a subcommand is included. If we write 


MTB > twosample cl c2 
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Figure 9.A.2.2 


Minitab will assume the two population variances are not equal, and it will perform 
the approximate ¢ test described in Theorem 9.2.3. If the intention is to assume that 
ox =o; (and do the ¢ test as described in Theorem 9.2.1), the proper syntax is 


MTB > twosample cl c2; 
SUBC > pooled. 


As is typical, Minitab associates the test statistic with a P-value rather than 
an “Accept Ho” or “Reject Ho” conclusion. Here, P = 0.001, which is consistent 
with the decision reached in Case Study 9.2.1 to “reject Hp at the w = 0.01 level of 
significance.” Figure 9.A.2.2 shows the “unpooled” analysis of these same data. The 
conclusion is the same, although the P-value has almost tripled, because both the 
test statistic and its degrees of freedom have decreased (recall Question 9.2.18). 


MTB > set cl 

DATA & 0,225 0.262 0.217 6.240 0.230 0,229. 0.235° 6/219 

DATA > end 

MTB > set c2 

DATA. > 0.209 0.205 0.196 0.210 0.202 0.207 0.224 0.223 0.220 0.201 
DATA > end 

MTB > name cl ‘X’ c2 ‘yY’ 

MTB > twosample cl c2 


Two-Sample T-Test and CI: X, Y 


Two-sample T for X vs Y 


N Mean StDev SE Mean 
x 8 0.2319 0.0146 0.0051 
yy 10 0.20970 0.00966 0.0031 
Difference = mu (X) - mu (Y) 


Estimate for difference: 0.02217 
95% CI for difference: (0.00900, 0.03535) 
T-Test of difference = 0 (vs not =): T-Value = 3.70 P-Value = 0.003 DF = 11 


Testing Ho:wy = fy Using Minitab Windows 


1. Enter the two samples in Cl and C2, respectively. 

2. Click on STAT, then on BASIC STATISTICS, then on 2-SAMPLE t. 

3. Click on SAMPLES IN DIFFERENT COLUMNS, and type C1 in 
FIRST box and C2 in SECOND box. 

4. Click on ASSUME EQUAL VARIANCES (if a pooled ¢ test is 

desired). 

Click on OPTIONS. 

Enter value for 100 (1 — w) in CONFIDENCE LEVEL box. 

Click on NOT EQUAL; then click on whichever Hj is desired. 

Click on OK; click on remaining OK. 
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Called by some the founder of twentieth-century statistics, Pearson received his 
university education at Cambridge, concentrating on physics, philosophy, and law. 
He was called to the bar in 1881 but never practiced. In 1911 Pearson resigned his 
chair of applied mathematics and mechanics at University College, London, and 
became the first Galton Professor of Eugenics, as was Galton’s wish. Together with 
Weldon, Pearson founded the prestigious journal Biometrika and served as its 
principal editor from 1901 until his death. 

—Karl Pearson (1857-1936) 


10.1 Introduction 


The give-and-take between the mathematics of probability and the empiricism of 
statistics should be, by now, a comfortably familiar theme. Time and time again we 
have seen repeated measurements, no matter their source, exhibiting a regularity of 
pattern that can be well approximated by one or more of the handful of probability 
functions introduced in Chapter 4. Until now, all the inferences resulting from this 
interfacing have been parameter specific, a fact to which the many hypothesis tests 
about means, variances, and binomial proportions paraded forth in Chapters 6, 7, 
and 9 bear ample testimony. Still, there are other situations where the basic form 
of px(k) or fy(y), rather than the value of its parameters, is the most important 
question at issue. These situations are the focus of Chapter 10. 

A geneticist, for example, might want to know whether the inheritance of a cer- 
tain set of traits follows the same set of ratios as those prescribed by Mendelian 
theory. The objective of a psychologist, on the other hand, might be to confirm 
or refute a newly proposed model for cognitive serial learning. Probably the most 
habitual users of inference procedures directed at the entire pdf, though, are statis- 
ticians themselves: As a prelude to doing any sort of hypothesis test or confidence 
interval, an attempt should be made, sample size permitting, to verify that the data 
are, indeed, representative of whatever distribution that procedure presumes. Usu- 
ally, this will mean testing to see whether a set of y,’s might conceivably represent a 
normal distribution. 
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Figure 10.2.1 


In general, any procedure that seeks to determine whether a set of data could 
reasonably have originated from some given probability distribution, or class of 
probability distributions, is called a goodness-of-fit test. The principle behind the 
particular goodness-of-fit test we will look at is very straightforward: First the 
observed data are grouped, more or less arbitrarily, into k classes; then each class’s 
“expected” occupancy is calculated on the basis of the presumed model. If it should 
happen that the set of observed and expected frequencies shows considerably more 
disagreement than sampling variability would predict, our conclusion will be that 
the supposed px (k) or fy(y) was incorrect. 

In practice, goodness-of-fit tests have several variants, depending on the speci- 
ficity of the null hypothesis. Section 10.3 describes the approach to take when both 
the form of the presumed data model and the values of its parameters are known. 
More typically, we know the form of px(k) or fy(y), but their parameters need to 
be estimated; these are taken up in Section 10.4. 

A somewhat different application of goodness-of-fit testing is the focus of 
Section 10.5. There, the null hypothesis is that two random variables are indepen- 
dent. In more than a few fields of endeavor, tests for independence are among the 
most frequently used of all inference procedures. 


10.2 The Multinomial Distribution 


Their diversity notwithstanding, most goodness-of-fit tests are based on essentially 
the same statistic, one that has an asymptotic chi square distribution. The underlying 
structure of that statistic, though, derives from the multinomial distribution, a direct 
extension of the familiar binomial. In this section we define the multinomial and 
state those of its properties that relate to goodness-of-fit testing. 

Given a series of n independent Bernoulli trials, each with success probability 
p, we know that the pdf for X, the total number of successes, is 


n 


PUX=K)= pull) = (7 


pha = pyr’ k=0,1,...,n (10.2.1) 

One of the obvious ways to generalize Equation 10.2.1 is to consider situations in 

which at each trial, one of t outcomes can occur, rather than just one of two. That is, 

we will assume that each trial will result in one of the outcomes rj, 7r2,...,7;, where 
t 

p(r;) = pi,i=1,2,...,t (see Figure 10.2.1). It follows, of course, that 5° p; =1. 


i=l 


pi = P(r), 
Possible 
outcomes 


B/S 
3 


Independent trials 


In the binomial model, the two possible outcomes are denoted s and f, where 
P(s)=p and P(f)=1-—p. Moreover, the outcomes of the n trials can be nicely sum- 
marized with a single random variable X, where X denotes the number of successes. 
In the more general multinomial model, we will need a random variable to count 
the number of times that each of the r;’s occurs. To that end, we define 


Theorem 
10.2.1 


Example 
10.2.1 
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X; =number of times 7; occurs, i=1,2,...,t 


t 
For a given set of n trials, X,} =k), X2=ko,..., X, =k, and )°k; =n. 


i=1 


Let X; denote the number of times that the outcome r; occurs, i =1,2,...,t, in a 
series of n independent trials, where p; = P(r;). Then the vector (X,, X2,..., X;) has a 
multinomial distribution and 


PX1,Xo,.,X,(K1, ko, .- ky) = P(X = ky, Xz =ho,..., X =r) 


ee ee 
= tyligiws? | et 


Proof Any particular sequence of k,1r\’s, kyro’s,..., and k,r,’s has probability 
pt pe ve pr. Moreover, the total number of outcome sequences that will gener- 
ate the values (ki, k2,...,k;) is the number of ways to permute n objects, k, of one 
type, kz of a second type, ..., and k, of a th type. By Theorem 2.6.2 that number is 
n!/k,!k2!...k,!, and the statement of the theorem follows. 


Depending on the context, the r;’s associated with the n trials in Figure 10.2.1 can 
be either single numerical values (or categories) or ranges of numerical values (or 
categories). Example 10.2.1 illustrates the first type; Example 10.2.2, the second. The 
only requirements imposed on the r;’s are (1) they must span all of the outcomes 
possible at a given trial and (2) they must be mutually exclusive. 


Suppose a loaded die is tossed twelve times, where 
pi= P(Facei appears)=ci, i=1,2,...,6 


What is the probability that each face will appear exactly twice? 
Note that 


6 


Yral=) cine. O* ) 


i=1 i=l 


which implies that c= x (and p; =i/21). In the terminology of Theorem 10.2.1, the 
possible outcomes at each trial are the t = 6 faces, 1 (=r) through 6 (=1r¢), and X; 
is the number of times face i occurs, i =1,2,..., 6. 

The question is asking for the probability of the vector 


(X, X2, X3, X4, X5, Xo) = (2, 2, 2, 2, 2, 2) 


According to Theorem 10.2.1, 


12! i732 6\ 
Pat G0. FS 2)= aes 
pire ae Ome! ai 


= 0.0005 a 
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Example 


10.2.2 


Theorem 
10.2.2 


Five observations are drawn at random from the pdf 


fro)=6yd—y), O<y<l 


What is the probability that one of the observations lies in the interval [0, 0.25), none 
in the interval [0.25, 0.50), three in the interval [0.50, 0.75), and one in the interval 
[0.75, 1.00]? 


Probability density 


Figure 10.2.2 


Figure 10.2.2 shows the pdf being sampled, together with the ranges r,, ro, r3, and 
r4, and the intended disposition of the five data points. The p;’s of Theorem 10.2.1 
are now areas. Integrating fy(y) from 0 to 0.25, for example, gives: 


0.25 
A = 6x1 = 3) ay 
0 


9/025 5 4/005 
al 2y"\0 
5 

~ 32 


By symmetry, p4= 3. Moreover, since the area under fy(y) equals 1, 


3 
1/, 10) 
= Pa = = —- — = — 
eet aa) 32) 32 


Let X; denote the number of observations that fall into the ith range, i = 
1,2,3,4. The probability associated with the multinomial vector (1, 0, 3, 1), then, 
is 0.0198: 


P(X, =1, X2.=0, X3 =3, X4=1) . Serer | 
ESE i een se A RE Tt ae 32 32 32 


= 0.0198 a 


A Multinomial/Binomial Relationship 


Since the multinomial pdf is conceptually a straightforward generalization of the 
binomial pdf, it should come as no surprise that each X; in a multinomial vector is, 
itself, a binomial random variable. 


Suppose the vector (X,, X2,..., X;) is a multinomial random variable with parame- 
ters n, Pi, P2,.-. and p,. Then the marginal distribution of X;,i =1,2,...,t, is the 
binomial pdf with parameters n and p;. 


Example 
10.2.3 
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Proof To deduce the pdf for X; we need simply to dichotomize the possible out- 
comes at each of the trials into “r;” and “not r;.” Then X; becomes, in effect, the 
number of “successes” in n independent Bernoulli trials, where the probability of 
success at any given trial is p;. By Theorem 3.2.1, it follows that X; is a binomial 
random variable with parameters n and p;. 


Comment Theorem 10.2.2 gives the pdf for any given X; in a multinomial vector. 
Since that pdf is the binomial, we also know that the mean and variance of each X; 
are E(X;) =np; and Var(X;) =np; (1 — p;), respectively. 


A physics professor has just given an exam to fifty students enrolled in a thermody- 
namics class. From past experience, she has reason to believe that the scores will be 
normally distributed with = 80.0 and o =5.0. Students scoring ninety or above will 
receive A’s, between eighty and eighty-nine, B’s, and so on. What are the expected 
values and variances for the numbers of students receiving each of the five letter 
grades? 

Let Y denote the score a student earns on the exam, and let r,, 72,73, r4, and rs 
denote the ranges corresponding to the letter grades A, B, C, D, and F respectively. 
Then 


Pp, = P(Student earns an A) 
= P(90< Y < 100) 
(= 80 Y-—80 — 100-— =) 
=p < & 


5 a ae 5 
= P(2.00< Z <4.00) 
= 0.0228 


If X, is the number of A’s that are earned, 


E(X,) =np; = 50(0.0228) = 1.14 


and 
Var(X 1) =np, (1 — py) =50(0.0228) (0.9772) = 1.11 


Table 10.2.1 lists the means and variances for all the X;’s. Each is an illustration 
of the Comment following Theorem 10.2.2. 


Table 10.2.1 
Score Grade Di E(X;)  Var(X;) 
90< Y < 100 A 0.0228 1.14 1.11 
80<Y <90 B 0.4772 23.86 12.47 
70<Y <80 Cc 0.4772 23.86 12.47 
60< Y <70 D 0.0228 1.14 1.11 
Y <60 F 0.0000 0.00 0.00 

| 
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Questions 


10.2.1 The Advanced Placement Program allows high 
school students to enroll in special classes in which a sub- 
ject is studied at the college level. Proficiency is measured 
by a national examination. Universities typically grant 
course credit for a sufficiently strong performance. The 
possible scores are 1, 2, 3, 4, and 5, with 5 being the high- 
est. The following table gives the probabilities associated 
with the scores recently made on the USS. history test (1): 


Score Probability 
1 0.116 
2 0.325 
3 0.236 
4 0.211 
5 0.112 


Suppose six students from a class take the test. What is the 
probability they earn three 5’s, two 4’s, and a 3? 


10.2.2 In Mendel’s classical experiments with peas, he 
produced hybrids in such a way that the probabilities of 


observing the different phenotypes listed below were z, 
3 3 


je: ig? and i, respectively. Suppose that four such hybrid 
plants were selected at random. What is the probability 


that each of the four phenotypes would be represented? 


Type Probability 
Round and yellow 9/16 
Round and green 3/16 
Angular and yellow 3/16 
Angular and green 1/16 


10.2.3 In classifying hypertension, three categories are 
used: individuals whose systolic blood pressures are less 
than 140, those with blood pressures between 140 and 
160, and those with blood pressures over 160. For males 
between the ages of eighteen and twenty-four, systolic 
blood pressures are normally distributed with a mean 
equal to 124 and a standard deviation equal to 13.7. 
Suppose a random sample of ten individuals from that 
particular demographic group are examined. What is the 
probability that six of the blood pressures will be in the 
first group, three in the second, and one in the third? 


10.2.4 An army enlistment officer categorizes poten- 
tial recruits by IQ into three groups—class I: 
< 90, class II: 90-110, and class HII: > 110. Given that 
the IQs in the population from which the recruits are 
drawn are normally distributed with = 100 and o = 16, 
calculate the probability that of seven enlistees, two will 
belong to class I, four to class I, and one to class III. 


10.2.5 A disgruntled Anchorage bush pilot, upset because 
his gasoline credit card was cancelled, fires six air-to- 
surface missiles at the Alaskan pipeline. If a missile lands 
anywhere within twenty yards of the pipeline, major struc- 
tural damage will be sustained. Assume that the probabil- 
ity function reflecting the pilot’s expertise as a bombardier 
is the expression 


60+ y 

fry) =} 60-—y 
3600’ 227 <0 
0, elsewhere 


where y denotes the perpendicular distance (in yards) 
from the pipeline to the point of impact. What is the prob- 
ability that two of the missiles will land within twenty 
yards to the left of the pipeline and four will land within 
twenty yards to the right? 


10.2.6 Based on his performance so far this season, a 
baseball player has the following probabilities associated 
with each official at-bat: 


Outcome Probability 
Out 2113 
Single .270 
Double O10 
Triple .002 
Home run 005 


If he has five official at-bats in tomorrow’s game, what are 
the chances he makes two outs and hits two singles and a 
double? 


10.2.7 Suppose that a random sample of fifty observations 
are taken from the pdf 

fro) =3y, O<y<l 
Let X; be the number of observations in the interval [0, 
1/4), X, the number in [1/4, 2/4), X; the number in [2/4, 
3/4), and X, the number in [3/4, 1]. 
(a) Write a formula for fx, ,x5,x3, x, (3, 7, 15, 25). 
(b) Find Var(X3). 


10.2.8 Let the vector of random variables (X,, X2, X3) 
have the trinomial pdf with parameters n, p,, px, and p3= 
1- Pi — P2.- That iS, 


n} ky ko ks 
P(X, =k, X. =ky, X3 =k) kilkytkt! Px P3 > 
kj =0,1,...,n; t1=1,2,3; ky tk, +k3=n 


By definition, 


Show that 


My,.x5,x3 (tis th, 3) = (pie! + pre? + p3e?)” 


My,,x,,x;(t,h, B) = Big eee) 
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the moment-generating function for My,,x,,y,(0,4,0), and My, x,,,(0,0,¢) are the moment- 
(X,, X2, X3) is given by 


generating functions for the marginal pdfs of X,, X., and 
X3, respectively. Use this fact, together with the result of 
Question 10.2.8, to verify the statement of Theorem 10.2.2. 


10.2.10 Let (k,,k,...,k,) be the vector of sample obser- 
vations representing a multinomial random variable with 


10.2.9 If My, x,,x;(ti,,%) is the moment-generating parameters n, pj), p2,..., and p,. Show that the maximum 


function 


for 


(x1, Xo, X3), 


Theorem 
10.3.1 


then My, .x,,.x,(4,0,0), likelihood estimate for p; is k;/n,i=1,2,...,t. 


10.3 Goodness-of-Fit Tests: All Parameters Known 


The simplest version of a goodness-of-fit test arises when an experimenter is able to 
specify completely the probability model from which the sample data are alleged to 
have come. It might be supposed, for example, that a set of y;’s is being generated 
by an exponential pdf with parameter equal to 6.3, or by a normal distribution with 
jt = 500 and o = 100. For continuous pdfs such as those, the hypotheses to be tested 
will be written 


Ao: fy(y) = fol) 


versus 


My: fy(y) F fo) 


where fy(y) and f,(y) are the true and presumed pdfs, respectively. For a typical 
discrete model, the null hypothesis would be written Ho: px(k) = p(k). It is not 
uncommon, though, for discrete random variables to be characterized simply by a 
set of probabilities associated with the ¢ r;’s defined in Section 10.2, rather than by 
an equation. Then the hypotheses to be tested take the form 


Ao: Pi = Pi,, P2 = P2,1-++s Pt = Pr, 
versus 
A: p; # p;, for at least one i 


The first procedure for testing goodness-of-fit hypotheses was proposed by Karl 
Pearson in 1900. Couched in the language of the multinomial, the prototype of Pear- 
son’s method requires that (1) the n observations be grouped into ¢ classes and 
(2) the presumed model be completely specified. Theorem 10.3.1 defines Pearson’s 
test statistic and gives the decision rule for choosing between Hp and H\. In effect, 
Ho is rejected if there is too much disagreement between the actual values for the 
multinomial X;’s and the expected values of those same X;’s. 


Let rj,r2,...,1 be the set of possible outcomes (or ranges of outcomes) associated 
with each of n independent trials, where P(r;) = p;,i=1,2,...,t. Let X; = number of 
times r; occurs, i=1,2,...,t. Then 


a. The random variable 
t 


ne > (Xi — npi)” 


nD; 
i=l Pi 


has approximately a x* distribution with t — 1 degrees of freedom. For the 
approximation to be adequate, the t classes should be defined so that np; => 5, 
for alli. 
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b. Let ky, kz,...,k, be the observed frequencies for the outcomes r,,1r2,...,11; 
respectively, and let np,,,np2,,...,Np;, be the corresponding expected frequen- 
cies based on the null hypothesis. At the a level of significance, Ho: fy (y) = fo(y) 
[or Ho: px(k) = po(k) or Ao: pi = Pi,, P2 = P2)>--+> Pt = Pr] is rejected if 

t 
d=)" i= np, Xi 


nD; l—a,t—-1 
i=l Pio 


(where np;, =>5 for all i). 


Proof A formal proof of part (a) lies beyond the scope of this text, but the 
direction it takes can be illustrated for the simple case where t = 2. Under that 
scenario, 


_ (X= np)? a5 (X2— np)? 

~ np np2 

— (Xi-npiy? , (n—Xi—nd— poh 
Mp n(1— pi) 

_ (Xi = api)? pr) + (—X1 + 0p) pi 
7 npi(1— pi) 

_ (Xi —npiy? 

~ mpi(1— pi) 


D 


From Theorem 10.2.2, E(X,) =np, and Var(X,) =np,(1 — p;), implying that D can 
be written 


pa[Mceay 
~ | /SVar(X1) 


By Theorem 4.3.1, then, D is the square of a variable that is asymptotically a stan- 
dard normal, and the statement of part (a) follows (for k = 2) from Definition 7.3.1. 
[Proving the general statement is accomplished by showing that the limit of the 
moment-generating function for D—as n goes to co—is the moment-generating 
function for a x? _, random variable. See (63).] 


Comment Although Pearson formulated his statistic before any general theories of 
hypothesis testing had been developed, it can be shown that a decision rule based 
on D is asymptotically equivalent to the generalized likelihood ratio test of Ho: p; = 


Pigs P2 = P21 +++s Pt = Pry: 


Case Study 10.3.1 


Inhabiting many tropical waters is a small (<1 mm) crustacean, Ceriodaphnia 
cornuta, that occurs in two distinct morphological forms: One has a series of 
“horns” protruding from its exoskeleton, while the other is more rounded (see 
Figure 10.3.1). Are these two variants equally likely to end up as fish food, or 
do their predators have a preference (211)? 


(Continued on next page) 
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Unhorned Horned 


Figure 10.3.1 Forms of C. cornuta. 


A large number of C. cornuta were introduced into a holding tank in a 
three-to-one ratio—three of the unhorned variety were added for every one 
with horns. Also present in the tank was a natural predator of C. cornuta, a small 
(6-cm) fish, Melaniris chagresi. After approximately one hour, long enough for 
the predator to have completed its feeding, the fish was sacrificed and the con- 
tents of its stomach examined. Among the forty-four crustacean casualties, the 
unhorned-to-horned ratio was forty to four. What do these body counts imply? 

Here, the two natural classes for the response variable are “unhorned” and 
“horned,” and under the null hypothesis that morphology has no effect on sur- 
vival, it would follow that the probability of either form’s being eaten should 
be proportional to the numbers of each kind available. If p; = P (Unhorned 
C. cornuta is eaten) and p2 = P (Horned C. cornuta is eaten), the experimenter’s 
objective reduces to a test of 


re 4 
0-PI= A> P2=4 
versus 


3 1 
Kip # a, PFs 


Let a =0.05. 


x7 Distribution 


Area = 0.05 


3.841 


Ls eje Ah 


Figure 10.3.2 x? distribution. 


Since t = 2, the behavior of D will be approximated by a x? distribution, for 
which the 0.05 critical value is 3.841 (see Figure 10.3.2). Substituting the values 
for the k;’s and np;,’s into the test statistic gives a d of 5.93: 


(Continued on next page) 
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(Case Study 10.3.1 continued) 


[wo—aa(pr 4-44(P 
4Q) * 40) 


= 5.93 


Our conclusion, then, is to reject Hy)—it would appear that morphology does 
have an effect on C. cornuta’s chances of being eaten. 


About the Data Rejecting Mp in this case does not actually imply what the analy- 
sis seems to suggest. The calculation of d shows that more of the unhorned cornuta 
(= 40) were eaten than the null hypothesis had predicted (= 33), and vice versa for 
the horned cornuta (four eaten as opposed to the eleven predicted). But the pres- 
ence or absence of horns was, in fact, irrelevant! A series of follow-up experiments 
analyzed in much the same way clearly indicated that the reason the unhorned cor- 
nuta were snacked on more often was their enlarged eyespot, which made them 
more visible—and sadly (for them) more edible. 


Case Study 10.3.2 


Once upon a time, when there were no computers (and calculations were actu- 
ally done using pencil and paper!), log tables were used to facilitate lengthy 
multiplications. In the early 1930s, Frank Benford, a physicist, reexamined the 
claim made many years earlier by Simon Newcomb that the first several pages 
in library logarithm books are dirtier than the last several pages (recall Exam- 
ple 3.3.3). Why should students and researchers have more reason to look up 
logarithms beginning with 1 or 2, rather than 8 or 9? Benford began look- 
ing closely at a variety of data sets, including molecular weights of chemicals, 
surface areas of rivers, and baseball statistics. 


Table 10.3.1 
Digit,i log,)(@i+ 1) — log, (i) 
1 0.301 
2 0.176 
3 0.125 
4 0.097 
P) 0.079 
6 0.067 
7 0.058 
8 0.051 
9 0.046 


What he confirmed to his surprise was the fact that the first nonzero digits 
in these various numbers are not equally likely to be 1’s, 2’s, ..., and 9’s, contrary 
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to what our intuition would almost certainly suggest. For reasons discussed in 
(78), the probability that the first nonzero digit is i tends to be 


pi =log,pG +1) —log;(i), i=1,2,...,9 (10.3.1) 


These latter probabilities are now known as Benford’s law (see Table 10.3.1). 
One particularly intriguing application of Benford’s law occurs in auditing, 
where eagle-eyed examiners are ever on the lookout for budgets whose num- 
bers have been fabricated to cover up falsified records. Bookkeepers are not 
likely to be aware of Equation 10.3.1 and would tend to “make up” entries in 
such a way that each first digit from 1 to 9 would occur roughly the same per- 
centage of the time. Let p; denote the probability that the first nonzero digit in a 


set of data isi,i=1,2,...,9. A goodness-of-fit test to identify possible instances 
of “creative” accounting would define the null hypothesis to be Ho: py = 
P1,> P2= P2,.°** » P= Po,, Where the Benford law probabilities become the p;,’s. 


An example of such a test is summarized in Table 10.3.2. The values in 
Column 2 are a breakdown of the 355 first digits appearing in the 1997-98 
operating budget for the University of West Florida (110). The corresponding 
expected frequencies based on Benford’s law are listed in Column 4, and the 
goodness-of-fit test statistic, d, is the sum of the entries in Column 5: 


_ [111 —355- (0.301)? [20 — 355 - (0.046) ]? 

~ 355 - (0.301) 355 - (0.046) 

=2.49 

Table 10.3.2 
Digit Observed, k; Benford p,, Expected (=355-p;,) (k; — 355p;,)"/355pi, 
1 111 0.301 106.9 0.16 
2 60 0.176 62.5 0.10 
3 46 0.125 44.4 0.06 
4 29 0.097 34.4 0.86 
5 26 0.079 28.0 0.15 
6 33 0.067 23.8 0.13 
7 21 0.058 20.6 0.01 
8 20 0.051 18.1 0.20 
9 20 0.046 16.3 0.82 
355 1.000 355.0 2.49 


Here, with t = 9 classes, the critical value for the hypothesis test comes from 
the chi square distribution with 8 df. If a is set equal to 0.05, x45 g = 15.507, so 
our conclusion is “fail to reject Ho.” 


About the Data There is no denying that Benford’s law is extremely counter- 
intuitive. On everyone’s credibility scale, it would lie somewhere to the right of 
ridiculous. That said, why Benford’s law holds for so many different phenomena 
has a surprisingly simple explanation. 
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Suppose a random variable Y takes on values that range over several orders of 
magnitude—say, from 10 to 1,000,000. Also, suppose the pdf of Y when plotted on a 
log base 10 scale tapers off slowly, to the extent that, for example, 


P(100 < Y < 1000) = P(1000 < Y < 10, 000) = P(10, 000 < Y < 100, 000) 
That is, 
P(2<logY <3)= P(3<logY <4)=P(4<logY <5) (10.3.2) 


which implies that log Y has approximately a uniform distribution. 

Now, consider the log cycle for Y values ranging from, say, 1000 up to but not 
including 10,000. Table 10.3.3 shows the log interval associated with each of the pos- 
sible first digits, 1 through 9. In the last column are the widths of each of the nine 
associated log intervals. 


Table 10.3.3 

Values of Y Associated Logs Width of Log Interval 
1000 < Y < 1999+ 3.00000 < log Y < 3.30103 0.30103 
2000 < Y < 2999+ 3.30103 <log Y < 3.47712 0.17609 
3000 < Y < 3999+ 3.47712 <log Y < 3.60206 0.12494 
4000 < Y <4999+ 3.60206 < log Y < 3.69897 0.09691 
5000 < Y <5999+ 3.69897 < log Y < 3.77815 0.07918 
6000 < Y < 6999+ 3.77815 <log Y < 3.84510 0.06695 
7000 < Y < 7999+ 3.84510 < log Y < 3.90309 0.05799 
8000 < Y < 8999+ 3.90309 < log Y < 3.95424 0.05115 
9000 < Y < 9999+ 3.95424 < log Y < 4.00000 0.04576 


By the earlier assumption, log Y has approximately a uniform distribution over 
much of the range of Y. It follows that for a given log cycle (in this case, 3 <log Y <4), 


P(a<logY <b)=b—a 


Therefore, if a value is chosen at random from the interval (1000 < Y < 10,000), the 
probability that its first digit will be 1 is the width of the interval of logs associated 
with numbers in the range 1000 < Y < 2000—that is, 3.30103 — 3.00000 = 0.30103. 
Applying that same argument to each of the possible first digits, 1 through 9, gives 
the entries listed in the third column of Table 10.3.3. 

The interval widths just described, of course, are the same for every log cycle. 
It follows, then, that if a random sample from fy(y) is taken over its entire range, 
roughly 30% of the y;’s will have a first digit of 1, roughly 18% will have a first digit 
of 2, and so on. The entries in the third column are, in fact, Benford’s law: 


P (First digit is i) =log,)(i + 1) — log,,(é),i=1,2,...,9 


One question still remains: Are there any frequently encountered probability 
functions that satisfy the assumptions imposed earlier on fy (y)? The answer is “yes.” 
There is an entire family of pdfs known as power models that have the extremely 
long tail necessary for Benford’s law to be applicable. Perhaps the most familiar 
member of that family is the Pareto distribution, where 
—a-1 


fy (y) =ay ;a>0,1l<y<o 


Figure 10.3.3 


Example 
10.3.1 
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Originally developed as a model for wealth allocation among members of a popu- 
lation (recall Question 5.2.14), Pareto’s distribution has been shown more recently 
to describe phenomena as diverse as meteorite size, areas burned by forest fires, 
population sizes of human settlements, monetary value of oil reserves, and lengths 
of jobs assigned to supercomputers. Figure 10.3.3 shows two examples of Pareto 
pdfs. 


2.0 


1.0 


fy Q)=ay *1 1sy<eo 


Probability density 


A new statistics software package claims to be able to generate random samples 

from any continuous pdf. Asked to produce forty observations representing the pdf 
fry(y) = 6y(1 — y),0 < y <1, it printed out the numbers displayed in Table 10.3.4. 
Are these forty y;’s a believable random sample from fy(y)? Do an appropriate 
goodness-of-fit test using the w = 0.05 level of significance. 


Table 10.3.4 


0.18 0.06 0.27 0.58 0.98 
0.55 0.24 0.58 0.97 0.36 
0.48 O11 059 O15 0.53 
0.29 046 0.21 0.39 0.89 
0.34 0.09 0.64 0.52 0.64 
0.71 056 048 0.44 0.40 
0.80 0.83 0.02 0.10 0.51 

0.43 0.14 074 0.75 0.22 


To apply Theorem 10.3.1 to a continuous pdf requires that the data first be 
reduced to a set of classes. Table 10.3.5 shows one possible grouping. The p;,’s in 
Column 3 are the areas under fy(y) above each of the five classes. For example, 


0.20 
Pi, =i 6y(1— y)dy =0.104 
0 
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Table 10.3.5 
Class Observed Frequency,k;  P;, 40p;, 
0<y<0.20 8 0.104 4.16 
0.20 < y <0.40 8 0.248 9.92 
0.40 < y < 0.60 14 0.296 11.84 
0.60 < y < 0.80 5 0.248 9.92 
0.80 < y < 1.00 BS] 0.104 4.16 


Column 4 shows the expected frequencies for each of the classes. Notice that 
40p,, and 40ps, are both less than 5 and fail to satisfy the “np; >5” restriction cited in 
part (a) of Theorem 10.3.1. That violation can be easily corrected, though—we need 
simply to combine the first two classes and the last two classes (see Table 10.3.6). 


Table 10.3.6 
Class Observed Frequency, k; P,, 40p;, 
0<y<0.40 16 0.352 14.08 
0.40 < y < 0.60 14 0.296 11.84 
0.60 < y <1.00 10 0.352 14.08 


The test statistic d, is calculated from the entries in Table 10.3.6: 


_ (16 — 14.08)? (14— 11.84)? (10 — 14.08)? 
~ 14.08 11.84 14.08 
= 1.84 


Since the number of classes ultimately being used is three, the number of degrees 
of freedom associated with d is 2, and we should reject the null hypothesis that the 
forty y;’s are arandom sample from fy(y)=6y(1—y),0<y<lifd>= rere But the 
latter is 5.991, so—based on these data—there is no compelling reason to doubt the 
advertised claim. = 


The Goodness-of-Fit Decision Rule—An Exception 


The fact that the decision rule given in part (b) of Theorem 10.3.1 is one-sided to the 
right seems perfectly reasonable—simple logic tells us that the goodness-of-fit null 
hypothesis should be rejected if d is large, but not if dis small. After all, small values 
of d will occur only if the observed frequencies are matching up very well with the 
predicted frequencies, and it seems that it would never make sense to reject Hp if 
that should happen. Not so. There is one specific scenario in which the appropriate 
goodness-of-fit test is one-sided to the left. 

Human nature being what it is, researchers have been known (shame on them) 
to massage, embellish, and otherwise falsify their data. Moreover, in their overzeal- 
ous efforts to support whatever theory they claim is true, they often make a second 
mistake of fabricating data that are too good—that is, that fit their model too closely. 
How can that be detected? By calculating the goodness-of-fit statistic and seeing if 
it falls less than x is where a would be set equal to, say, 0.05 or 0.01. 
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Case Study 10.3.3 


Gregor Mendel (1822-1884) was an Austrian monk and a scientist ahead of 
his time. In 1866 he wrote “Experiments in Plant Hybridization,” which sum- 
marized his exhaustive studies on the way inherited traits in garden peas are 
passed from generation to generation. It was a landmark piece of work in which 
he correctly deduced the basic laws of genetics without knowing anything about 
genes, chromosomes, or molecular biology. But for reasons not entirely clear, 
no one paid any attention and his findings were virtually ignored for the next 
thirty-five years. 

Early in the twentieth century, Mendel’s work was rediscovered and quickly 
revolutionized the cultivation of plants and the breeding of domestic animals. 
With his posthumous fame, though, came some blistering criticism. No less an 
authority than Ronald A. Fisher voiced the opinion that Mendel’s results in that 
1866 paper were too good to be true—the data had to have been falsified. 

Table 10.3.7 summarizes one of the data sets that attracted Fisher’s atten- 
tion (112). Two traits of garden peas were being studied—their shape (round 
or angular) and their color (yellow or green). If “round” and “yellow” are 
dominant and if the alleles controlling those two traits separate independently, 
then (according to Mendel) dihybrid crosses should produce four possible 
phenotypes, with probabilities 9/16, 3/16, 3/16, and 1/16, respectively. 


Table 10.3.7 
Phenotype Obs. Freq. Mendel’s Model Exp. Freq. 
(round, yellow) 315 9/16 312.75 
(round, green) 108 3/16 104.25 
(angular, yellow) 101 3/16 104.25 
(angular, green) 32 1/16 34.75 
0.30 


Density 


Figure 10.3.4 


Notice how closely the observed frequencies approximate the expected fre- 
quencies. The goodness-of-fit statistic from Theorem 10.3.1 (with 4 — 1 = 3df) is 
equal to 0.47: 


(Continued on next page) 
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(Case Study 10.3.3 continued) 


_ (315 — 312.75) 


(108 — 104.25)2 


(101 — 104.25)? (32 — 34.75)? 


= 0.47 


312.75 


Figure 10.3.4 shows that the value of d = 0.47 does look suspiciously small. By 
itself, it does not rise to the level of a “smoking gun,” but Mendel’s critics had 
similar issues with other portions of his data as well. 


104.25 104.25 34.75 


About the Data Almost seventy-five years have passed since Fisher raised his con- 
cerns about the legitimacy of Mendel’s data, but there is still no broad consensus on 
whether or not portions of the data were falsified. And if they were, who was respon- 
sible? Mendel, of course, would be the logical suspect, but some would-be cold case 
detectives think the gardener did it! What actually happened back in 1866 may never 
be known, because many of Mendel’s original notes and records have been lost or 


destroyed. 
Questions 
10.3.1 Verify the following identity concerning the statis- 


tic of Theorem 10.3.1. Note that the right-hand side is 
more convenient for calculations. 


s (X; =i ay x 


n NDj 
i=l Pi aria 


10.3.2 One hundred unordered samples of size 2 are 
drawn without replacement from an urn containing six red 
chips and four white chips. Test the adequacy of the hyper- 
geometric model if zero whites were obtained 35 times; 
one white, 55 times; and two whites, 10 times. Use the 0.10 
decision rule. 


10.3.3 Consider again the previous question. Suppose, 
however, that we do not know whether the samples had 
been drawn with or without replacement. Test whether 
sampling with replacement is a reasonable model. 


10.3.4 Show that the common belief in the propensity of 
babies to choose an inconvenient hour for birth has a basis 
in observation. A maternity hospital reported that out of 
one year’s total of 2650 births, some 494 occurred between 
midnight and 4 a.m. (168). Use the goodness-of-fit test to 
show that the data are not what we would expect if births 
are assumed to occur uniformly in all time periods. Let 
a =0.05. 


10.3.5 Analyze the data in the previous problem using 
the techniques of Section 6.3. What is the relationship 
between the two test statistics? 


10.3.6 A number of reports in the medical literature 
suggest that the season of birth and the incidence of 
schizophrenia may be related, with a higher proportion of 
schizophrenics being born during the early months of the 


year. A study (72) following up on this hypothesis looked 
at 5139 persons born in England or Wales during the years 
1921-1955 who were admitted to a psychiatric ward with a 
diagnosis of schizophrenia. Of these 5139, 1383 were born 
in the first quarter of the year. Based on census figures in 
the two countries, the expected number of persons, out of 
a random 5139, who would be born in the first quarter is 
1292.1. Do an appropriate x” test with a =0.05. 


10.3.7 In a move that shocked candy traditionalists, the 
M&M/Mars Company recently replaced the tan M&M’s 
with blue ones. More than ten million people had voted 
in an election to select the new color. On learning of 
the change, one concerned consumer counted the num- 
ber of each color appearing in three pounds of M&M’s 
(55). His tally, shown in the following table, suggests that 
not all the colors appear equally often—blues, in particu- 
lar, are decidedly less common than browns. According to 
an M&M/Mars spokesperson, there are actually three fre- 
quencies associated with the six colors: 30% of M&M’s 
are brown, yellow and red each account for 20%, and 
orange, blue, and green each occur 10% of the time. Test 
at the a = 0.05 level of significance the hypothesis that the 
consumer’s data are consistent with the company’s stated 
intentions. 


Color Number 
Brown 455 
Yellow 343 
Red 318 
Orange 152 
Blue 130 
Green 129 


10.3.8 The following table lists World Series lengths for 
the fifty years from 1926 to 1975. Test at the 0.10 level 
whether these data are compatible with the model that 
each World Series game is an independent Bernoulli trial 
with p = P(AL wins) = P(NL wins) = 


1 
a 


Number of Games Number of Years 


4 9 
5 11 
6 8 
7 22 


10.3.9 Records kept at an eastern racetrack showed the 
following distribution of winners as a function of their 
starting-post position. All 144 races were run with a full 
field of eight horses. 


Starting Post Ea ae eee ee 
Number of Winners | 32 21 19 20 16 11 14 11 


Test an appropriate goodness-of-fit hypothesis. Let a = 
0.05. 


10.3.10 It was noted in Question 4.3.24 that the mean (w) 
and standard deviation (0) of pregnancy durations are 266 
days and 16 days, respectively. Accepting those as the true 
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parameter values, test whether the additional assump- 
tion that pregnancy durations are normally distributed is 
supported by the following list of seventy pregnancy dura- 
tions reported by County General Hospital. Let a = 0.10 
be the level of significance. Use “220 < y < 230,” “230 < 
y < 240,” and so on, as the classes. 


251 264 234 283 226 244 269 241 276 274 
263 243 254 276 241 232 260 248 284 253 
265 235 259 279 256 256 254 256 250 269 
240 261 263 262 259 230 268 284 259 261 
268 268 264 271 263 259 294 259 263 278 
267 293 247 244 250 266 286 263 274 253 
281 286 266 249 255 233 245 266 265 264 


10.3.11 In the past, defendants convicted of grand theft 
auto served Y years in prison, where the pdf describing 
the variation in Y had the form 


1 
fr=5y 0<y<3 


Recent judicial reforms, though, may have impacted the 
punishment meted out for this particular crime. A review 
of 50 individuals convicted of grand theft auto five years 
ago showed that 8 served less than one year in jail, 16 
served between one and two years, and 26 served between 
two and three years. Are these data consistent with fy (y)? 
Do an appropriate hypothesis test using the a = 0.05 level 
of significance. 


10.4 Goodness-of-Fit Tests: Parameters Unknown 


More common than the sort of problems described in Section 10.3 are situations 
where the experimenter has reason to believe that the response variable follows 
some particular family of pdfs—say, the normal or the Poisson—but has little or 
no prior information to suggest what values should be assigned to the model’s 
parameters. In cases such as these, we will carry out the goodness-of-fit test by 
first estimating all unknown parameters, preferably with the method of maxi- 
mum likelihood. The appropriate test statistic, denoted d,, is a modified version of 


Pearson’s d: 
t A 
(ki —npi,)? 
d=) —- 
2p, 
i=1 
Here, the factors p,,, p2,,..-, P;, denote the estimated probabilities associated with 
the outcomes r),/2,...,7;- 

For example, suppose n = 100 observations are taken from a distribution 
hypothesized to be an exponential pdf, f,(y) = Ae~*”, y => 0, and suppose that 
r,; is defined to be the interval from 0 to 1.5. Jf the numerical value of 1 is 
known-—say, 4 = 0.4—then the probability associated with r; would be denoted pj,, 
where 


1.5 
Pi, = / 0.4e~°* dy =0.45 
0 
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Theorem 
10.4.1 


100 
On the other hand, suppose 4 is not known but )° y; = 200. Since the maximum 
i=l 
likelihood estimate for A in this case is 


100 
100 
san [S) Yi = 599 = 0-50 
i=1 


(recall Question 5.2.3), the estimated null hypothesis exponential model is f,(y) = 
0.50e-°°°”, y > 0, and the corresponding estimated probability associated with rj is 
denoted p,,, where 


1.5 1.5 1.5 
bu= fo foorrody= fo re ay= [ 05¢% dy =053 
0 0 0 


So, whereas d compares the observed frequencies of the r;’s with their expected 
frequencies, d; compares the observed frequencies of the r;’s with their estimated 
expected frequencies. 

We pay a price for having to rely on the data to fill in details about the pre- 
sumed model: Each estimated parameter reduces by 1 the number of degrees of 
freedom associated with the x7 distribution approximating the sampling distribution 
of D,. And, as we have seen in other hypothesis testing situations, as the number of 
degrees of freedom associated with the test statistic decreases, so does the power of 
the test. 


Suppose that a random sample of n observations is taken from fy(y) [or px(k)], a 
pdf having s unknown parameters. Let r,,1r2,...,7; be a set of mutually exclusive 
ranges (or outcomes) associated with each of the n observations. Let p; = estimated 
probability of r;,i=1,2,...,t (as calculated from fy(y) [or px(k)] after the pdfs s 
unknown parameters have been replaced by their maximum likelihood estimates). Let 
X; denote the number of times that r; occurs, i=1,2,...,t. Then 


a. the random variable 
t A 
(X; —np;)? 
D\= —— 
2, NPi 


has approximately a x* distribution with t — 1 —s degrees of freedom. For the 
approximation to be fully adequate, the r;’s should be defined so that np; = 5 for 
all i. 

b. to test Ho: fy(v) = fo(y) [or Ho: px(k) = po(k)] at the a level of significance, 


calculate 
t ~ \2 
(ki —pi,) 
a= PY 
i=1 
where k,, kz, ..., k, are the observed frequencies of r,,r2,..., 11, respectively, and 
NP\,,NPr,,--., Np, are the corresponding estimated expected frequencies based 


on the null hypothesis. If 


2 
d\ = X1—a,t—1—s 


Ho should be rejected. (The r;’s should be defined so that np;, > 5 for alli.) 
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Case Study 10.4.1 


Despite the fact that batters occasionally go on lengthy hitting streaks (and 
slumps), there is reason to believe that the number of hits a baseball player gets 
in a game behaves much like a binomial random variable. Data demonstrating 
that claim have come from a study (132) of National League box scores from 
Opening Day through mid-July in 1996. Players had exactly four official at-bats 
a total of 4096 times during that period. The resulting distribution of their hits is 
summarized in Table 10.4.1. Are these numbers consistent with the hypothesis 
that the number of hits a player gets in four at-bats is binomially distributed? 


Table 10.4.1 


Number of Hits,i Obs. Freq.,k; Estimated Exp. Freq., np;, 


0 1280 1289.1 
1 1717 1728.0 
ris 42 915 868.6 
3 167 194.0 
4 17 16.3 


Here the five possible outcomes associated with each four-at-bat game 
would be the number of hits a player makes, so r} = 0,r2 = 1,...,7r5 = 4. The 
presumption to be tested is that the probabilities of those r;’s are given by the 
binomial distribution — that is, 


4\ 
P(Player gets i hits in four at-bats) = (7) (_—p)*", i=0,1,2,3,4 
l 
where p = P(Player gets a hit on a given at-bat). 

In this case, p qualifies as an unknown parameter and needs to be estimated 
before the goodness-of-fit analysis can go any further. Recall from Exam- 
ple 5.1.1 that the maximum likelihood estimate for p is the ratio of the total 
number of successes divided by the total number of trials. With successes being 
“hits” and trials being “at-bats,” it follows that 

__ 1280(0) + 171701) + 915(2) + 167(3) + 17(4) 4116 0.251 
= 4096(4) ~ 16384 
The precise null hypothesis being tested, then, can be written 
4 
Ho: P (Player gets i hits) = (;) (0.251)'(0.749)*", i =0,1,2,3,4 
I 


The third column in Table 10.4.1 shows the estimated expected frequencies 
based on the estimated Ho pdf. For example, 


np, = estimated expected frequency for rj 


= estimated number of times players would get 0 hits 
4 0 4 
= 4096. 0 (0.251)" (0.749) 


= 1289.1 


(Continued on next page) 


512 Chapter 10 Goodness-of-Fit Tests 


(Case Study 10.4.1 continued) 


Corresponding to 1289.1, of course, is the entry in the first row of Column 2 
in Table 10.4.1, listing the observed number of times players got zero hits 
(= 1280). 

If we elect to test the null hypothesis at the a = 0.05 level of significance, 
then by Theorem 10.4.1 Ho should be rejected if 


d\ > Xooss-1-4 = 7.815 


Here the degrees of freedom associated with the test statistic would be t — 1 — 
s=5—1-—1=3 because s = 1 dfis lost as a result of p having been replaced by 
its maximum likelihood estimate. 

Putting the entries from the last two columns of Table 10.4.1 into the 
formula for d; gives 


_ 1280 — 1289.1)? (1717 — 1728.0)? _ (915 — 868.6)" 


a= 1289.1 1728.0 868.6 
(167— 194.0)? (17— 16.3)? 
194.0 16.3 
=6.401 


Our conclusion, then, is to fail to reject H) —the data summarized in Table 10.4.1 
do not rule out the possibility that the numbers of hits players get in four-at-bat 
games follow a binomial distribution. 


About the Data The fact that the binomial pdf is not ruled out as a model for the 
number of hits a player gets in a game is perhaps a little surprising in light of the 
fact that some of its assumptions are clearly not being satisfied. The parameter p, 
for example, is presumed to be constant over the entire set of trials. That is certainly 
not true for the data in Table 10.4.1. Not only does the “true” value of p obviously 
vary from player to player, it varies from at-bat to at-bat for the same player if differ- 
ent pitchers are used during the course of a game. Also in question is whether each 
at-bat qualifies as a truly independent event. As a game progresses, Major League 
players (hitters and pitchers alike) surely rehash what happened on previous at-bats 
and try to make adjustments accordingly. To borrow a term we used earlier in con- 
nection with hypothesis tests, it would appear that the binomial model is somewhat 
“robust” with respect to departures from its two most basic assumptions. 


Case Study 10.4.2 


The Poisson probability function often models rare events that occur over time, 
which suggests that it may prove useful in describing actuarial phenomena. 
Table 10.4.2 raises one such possibility—listed are the daily numbers of death 
notices for women over the age of eighty that appeared in the London Times 
over a three-year period (74). Is it believable that these fatalities are occurring 
in a pattern consistent with a Poisson pdf? 

(Continued on next page) 
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Table 10.4.2 
Number of Deaths, i Obs. Freq., k; Est. Exp. Freq., 1 p;, 


0 162 126.8 
1 267 273.5 
wv) 271 294.9 
3 185 212.1 
4 111 114.3 
5 61 49.3 
6 27 17.8 
7 8 OD 
8 3 1.4 
9 1 0.3 
10+ 0 0.1 
1096 1096 


To claim that a Poisson pdf can model these data is to say that 
P(i women over the age of eighty die on a given day) =e*A‘/i!, i=0,1,2,... 


where i is the expected number of such fatalities on a given day. Other than 
what the data may suggest, there is no obvious numerical value to assign to 4 
at the outset. However, from Chapter 5, we know that the maximum likelihood 
estimate for the parameter in a Poisson pdf is the sample average rate at which 
the events occurred—that is, the total number of occurrences divided by the 
total number of time periods covered. Here, that quotient comes to 2.157: 


_ total number of fatalities 


total number of days 
_ 00162) + 12267) + 2@71) +--+ +90) 
7 1096 


= 2.157 


The estimated expected frequencies, then, are calculated by multiplying 
1096 times e~?:!97(2.157)'/i!,i =0,1,2,.... The third column in Table 10.4.2 
lists the entire set of np;’s. [Note: Whenever the model being fitted has an 
infinite number of possible outcomes (as is the case with the Poisson), the 
last expected frequency is calculated by subtracting the sum of all the others 
from n. This guarantees that the sum of the observed frequencies is equal to 
the sum of the estimated expected frequencies.] Applied to these data, that 


proviso implies that 


estimated expected frequency for “10+” = 1096—126.8—273.5 —---—-0.3=0.1 


One final modification needs to be made before the test statistic, d), can be 
calculated. Recall that each estimated expected frequency should be at least 5 
in order for the x* approximation to the pdf of D; to be adequate. The last three 


(Continued on next page) 
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(Case Study 10.4.2 continued) 


classes in Table 10.4.2, though, all have very small values for np;, (1.4, 0.3, 
and 0.1). To comply with the “np;, > 5” requirement, we need to pool the last 
four rows into a “7+” category, which would have an observed frequency of 
12 (=0+1+3-+8) and an estimated expected frequency of 7.3 (=0.1+0.3+ 
1.4+5.5) (see Table 10.4.3). 


Table 10.4.3 

Number of Deaths,i | Obs. Freq.,k; | Est. Exp. Freq., nj, 

0 162 126.8 

1 267 273.5 

2 271 294.9 

3 185 212.1 

Pat eteels hd 111 114.3 

5 61 49.3 

6 27 17.8 

7+ 12 7.3 

1096 1096 


Based on the observed and estimated expected frequencies for the eight r;’s 
identified in Table 10.4.3, the test statistic, d,, equals 25.98: 
_ (162 — 126.8)? (267 — 273.5)? (12 — 7.3)? 
126.8 2735 73 
= 25.98 


1 


With eight classes and one estimated parameter, the number of degrees of 
freedom associated with d, is 6 (=8 — 1-1). To test 
Ho: P(i women over eighty die on a given day) =e 797 (2.157)'/i!, i=0,1,2,... 
at the a = 0.05 level of significance, we should reject Hp if 

di > Xo05,6 


But the 95th percentile of the x? distribution is 12.592, which lies well to the 
left of dj, so our conclusion is to reject Hy—there is too much disagreement 
between the observed and estimated expected frequencies in Table 10.4.3 to be 
consistent with the hypothesis that the data’s underlying probability model is a 
Poisson pdf. 


About the Data A row-by-row comparison of the entries in Table 10.4.3 shows a 
pronounced excess of days having zero fatalities and also an excess of days having 
large numbers of fatalities (five, six, or seven plus). One possible explanation for 
those disparities would be that the Poisson assumption that 4 remains constant over 
the entire time covered is not satisfied. Events such as flu epidemics, for example, 
might cause 4 to vary considerably from month to month and contribute to the data’s 
“disconnect” from the Poisson model. 
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Case Study 10.4.3 


Listed in Table 10.4.4 are the times (in days) that it takes each of the fifty states, 
the District of Columbia, and Puerto Rico to process a Social Security disability 
claim (185). Can these fifty-two measurements be considered a random sample 
from a normal distribution? Test the appropriate hypothesis at the a = 0.05 


level of significance. 


Table 10.4.4 

State Time State Time State Time 
Alabama 67.4 Louisiana 86.0 Oklahoma 104.2 
Alaska 81.8 Maine 51.3. Oregon 101.6 
Arizona 106.5 Maryland 74.1 Pennsylvania 60.1 
Arkansas 53.5 Massachusetts 77.8 Puerto Rico 108.3 
California 122.8 Michigan 75.1 Rhode Island 84.8 
Colorado 71.6 Minnesota 56.5 S. Carolina 70.8 
Connecticut 71.4 Mississippi 63.2 S. Dakota 47.3 
Delaware 73.1 Missouri 57.1 Tennessee 72.4 
D.C. 100.5 Montana 62.2 Texas 72.5 
Florida 63.9 Nebraska 70.2 Utah 81.1 
Georgia 74.6 Nevada 113.2 Vermont 92.5 
Hawaii 115.8 NewHampshire 76.4 _ Virginia 46.2 
Idaho 47.9 New Jersey 109.6 Washington 76.0 
Illinois 68.1 New Mexico 74.1 W. Virginia 78.8 
Indiana 55.3. New York 86.2 Wisconsin 66.7 
Iowa 61.2 N. Carolina 59.5 Wyoming 45.6 
Kansas 78.9 N. Dakota 53.9 

Kentucky 61.1 Ohio 69.8 


Shown in Column 1 of Table 10.4.5 is an initial breakdown of the range of 
Y into nine intervals. Notice that the first and last intervals are open-ended to 
reflect the fact that the presumed underlying normal distribution is defined for 


the entire real line. 


Table 10.4.5 


Interval Obs. Freq., k; Pio Est. Exp. Freq. 

y <50.0 4 0.0968 5.03 

50.0 < y < 60.0 7 0.1209 6.29 
60.0 < y < 70.0 10 0.1797 9.34 
70.0 < y < 80.0 16 0.2052 10.67 
80.0 < y < 90.0 5 0.1797 9.34 
90.0 < y < 100.0 1 0.1209 6.29 
100.0 < y < 110.0 6 0.0616 3.20 
110.0< y < 120.0 2 0.0253 1.31 
y > 120.0 1 0.0099 0.51 


(Continued on next page) 
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(Case Study 10.4.3 continued) 


In a typical test of normality—and these data are no exception—both 
parameters, jz and o, need to be estimated before any expected frequencies 
can be calculated. Here, using the formulas for the sample mean and sample 
standard deviation given in Chapter 5, 


Me = y =75.0 days 
and 
Oe = 5 = 19.3 days 
The estimated probability, p;,, associated with the ith interval is calculated by 
using y and s to define an approximate Z transformation. For example, 
60.0 — 75.0 70.0 — 75.0 
19.3 19.3 


= P(—0.78 < Z < —0.26) = 0.1797 


P3 = P(60.0< Y <70.0)=P ( 


The estimated expected frequencies, then, are the products 52- p,,, for i = 
1,2,...,9. For the interval 60 < y < 70.0, 


n+ Ps) =52(0.1797) =9.34 


Notice that the three bottom-most subintervals in Column 4 of Table 10.4.5 
have estimated expected frequencies less than 5, which violates the condition 
imposed in Theorem 10.4.1. Collapsing those three into a single interval yields 
a revised set of data on which the goodness-of-fit statistic can be calculated (see 
Table 10.4.6). 


Table 10.4.6 
Interval Obs. Freq. k; Pio Est. Exp. Freq. 

y <50.0 4 0.0968 5.03 
50.0 < y < 60.0 7 0.1209 6.29 
60.0 < y < 70.0 10 0.1797 9.34 
70.0 < y < 80.0 16 0.2052 10.67 
80.0 < y < 90.0 5 0.1797 9.34 
90.0 < y < 100.0 1 0.1209 6.29 
y => 100.0 9 0.0968 5.03 
52 1 52.0 


According to Theorem 10.4.1, the assumption that Y is a normally 
distributed random variable should be rejected at the a = 0.05 level of 
significance if 

dy > X0.95,7-1-2 = X0.95,4 = 9-488 
since the revised data grouped into seven classes and two parameters in fy(y) 
have been estimated. But 
(4—5.03)* (7-6.29)? (9 — 5.03)* 
A= ms = 12.59 
503. 629 SC«SB 
so the conclusion is to reject the normality assumption. 


Questions 
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About the Data These data raise two obvious questions: (1) What effect does the 
conclusion of the goodness-of-fit test have on the legitimacy of other analyses that 
might be done—for example, the construction of a confidence interval for ~? and 
(2) What might account for the distribution of processing times not being normal? 

The answer to the first question is easy—none. It is true that the derivation of 
the formula for, say, a confidence interval for 4. assumes that the data are normally 
distributed (recall Theorem 7.4.1). In this case, though, mitigating circumstances 
make that assumption not so critical. The sample size is large (n = 52); the degree of 
nonnormality is not egregious (had a been set equal to 0.01, Hp would not have been 
rejected); and, as the discussion on pp. 406-410 pointed out, procedures involving 
the Student ¢ distribution are very robust with respect to departures from normality. 

The second question is more problematic. The second column in Table 10.4.5 
shows that the data are clearly skewed to the right, and there is even a suggestion 
that the fifty-two observations might represent a mixture of two distributions, each 
having a different mean. The nine states representing the highest processing times 
appear to have nothing in common in terms of size, location, or demographics. So 
why is the right-hand tail of the distribution so different from the left-hand tail? 
Perhaps the states with the longest waiting times have smaller staffs (relative to their 
workloads) or they use less up-to-date equipment or follow different procedures. 
Another possibility —and one that can always be a factor when data are coming from 
different sources—is that not every state is defining or measuring “processing time” 
in the same way. From a public policy standpoint, researching the second question 
is obviously more important than simply doing a goodness-of-fit test to answer the 
first. 


10.4.1 A public policy polling group is investigating 
whether people living in the same household tend to make 
independent political choices. They select two hundred 
homes where exactly three voters live. The residents are 
asked separately for their opinion (“yes” or “no”) on a 
city charter amendment. If their opinions are formed inde- 
pendently, the number saying “yes” should be binomially 
distributed. Do an appropriate goodness-of-fit test on the 
data below. Let a =0.05. 


No. Saying “yes” Frequency 


0 
1 
2 73 
3 


10.4.2 From 1837 to 1932, the U.S. Supreme Court had 
forty-eight vacancies. The table in the next column shows 
the number of years in which exactly k of the vacan- 
cies occurred (185). At the a = 0.01 level of significance, 
test the hypothesis that these data can be described by a 
Poisson pdf. 


Number of Vacancies Number of Years 


0 59 
1 27 
2 9 
3 1 
44+ 0 


10.4.3 As a way of studying the spread of a plant dis- 
ease known as creeping rot, a field of cabbage plants was 
divided into 270 quadrats, each quadrat containing the 
same number of plants. The following table lists the num- 
bers of plants per quadrat showing signs of creeping rot 
infestation. 


Number of Infected 


Plants/Quadrat Number of Quadrats 
0 38 
1 57 
2 68 
3 47 
4 23 
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Number of Infected 


Plants/Quadrat Number of Quadrats 
5 9 
6 10 
7 7 
8 3 
9 4 
10 2 
11 1 
12 1 
13+ 0 


Can the number of plants infected with creeping rot per 
quadrat be described by a Poisson pdf? Let a = 0.05. 
What might be a physical reason for the Poisson not being 
appropriate in this situation? Which assumption of the 
Poisson appears to be violated? 


10.4.4 Carry out the details for a goodness-of-fit test on 
the horse kick data of Question 4.2.10. Use the 0.01 level 
of significance. 


10.4.5 In rotogravure, a method of printing by rolling 
paper over engraved, chrome-plated cylinders, the printed 
paper can be flawed by undesirable lines called bands. 
Bands occur when grooves form on the cylinder’s sur- 
face. When this happens, the presses must be stopped, 
and the cylinders repolished or replated. The follow- 
ing table gives the number of workdays a printing firm 
experienced between successive banding shutdowns (39). 
Fit these data with an exponential model and perform 
the appropriate goodness-of-fit test at the 0.05 level of 
significance. 


Workdays Between Shutdowns Number Observed 
0-1 130 
12 41 
2-3 25 
3-4 8 
4-5 2 
5-6 3 
6-7 1 
7-8 1 


10.4.6 Do a goodness-of-fit test for normality on the 
SAT data in Table 3.13.1. Take the sample mean 
and sample standard deviation to be 949.4 and 68.4, 
respectively. 


10.4.7 A sociologist is studying various aspects of the per- 
sonal lives of preeminent nineteenth-century scholars. A 
total of 120 subjects in her sample had families consisting 
of two children. The distribution of the number of boys in 
those families is summarized in the following table. Can it 


be concluded that the number of boys in two-child fami- 
lies of preeminent scholars is binomially distributed? Let 
a=0.05. 


Number of boys 0 1 2 
Number of families 24 64 32 


10.4.8 In theory, Monte Carlo studies rely on comput- 
ers to generate large sets of random numbers. Partic- 
ularly important are random variables representing the 
uniform pdf defined over the unit interval, fy(y) =1,0< 
y <1. In practice, though, computers typically generate 
pseudorandom numbers, the latter being values produced 
systematically by sophisticated algorithms that presum- 
ably mimic “true” random variables. Below are one hun- 
dred pseudorandom numbers from a uniform pdf. Set up 
and test the appropriate goodness-of-fit hypothesis. Let 
a =0.05. 


216 .673 130 587 .044 501 .958 .415 .872 .329 
.786 .243. .700 .157 614 .071 528 .985 .442 .899 
356.813. .270 .727 184 .641 .098 555 .012 .469 
926 .383 840 .297 .754 .211 .668 125 .582 .039 
496 .953 .410 .867 .324 .781 .238 .695 .152 .609 
066 523.980 .437 .894 .351 .808 .265 .722 .179 
.636 .093 550 .007 .464 .921 .378 .835 .292 .749 
.206 .663 120 577 .034 491 .948 .405 .862 .319 
.776 233, 690 .147 .604 .061 518 .975 .432 .889 
346 .803 .260 .717 .174 .631 .088 .545 .002 .459 


10.4.9 Because it satisfies all the assumptions implicit 
in the Poisson model, radioactive decay should be 
described by a probability function of the form py y(k) = 
ek /k!,k =0,1,2,..., where the random variable X 
denotes the number of particles emitted (or counted) 
during a given time interval. Does that hold true 
for the Rutherford and Geiger data given in Case 
Study 4.2.2? Set up and carry out an appropriate 
analysis. 


10.4.10 Carry out the details to test whether the suf- 
frage data described in Question 4.2.13 follow a Poisson 
model. 


10.4.11 Is the following set of data likely to have 
come from the geometric pdf, px(k) = (1 — p)*'p, 
| ees 


De Be OV, 22 2 5 AN PDO BY 3 
SA De Ae 7? De De 8: A. OF 
2 6 2 BS 3 3. 2 5 
422 3 63 64 9 3 
3.7 5 13 4 3 4 6 2 
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10.4.12 To raise money for a new rectory, the members of 108 110 21 6 44 
a church hold a raffle. A total of n tickets are sold (num- 89 68 SO 13. 63 
bered 1 through n), out of which a total of fifty winners 84 64 69 92 12 
are to be drawn presumably at random. The following are 46 78 113 104 105 
the fifty lucky numbers. Set up a goodness-of-fit test that 9 115 58 2 20 
focuses on the randomness of the draw. Use the 0.05 level 19 96 28 72 81 
of significance. 32. 75 3 49 86 


10.5 Contingency Tables 


Hypothesis tests, as we have seen, take several fundamentally different forms. Those 
covered in Chapters 6, 7, and 9 focus on parameters of pdfs—the one-sample, two- 
sided t test, for example, reduces to a choice between Ap: uw = fy and My: hw # bo. 
Earlier in this chapter, the pdf itself was the issue, and the goodness-of-fit tests in 
Sections 10.3 and 10.4 dealt with null hypotheses of the form Ho: fy (y) = fo(y). 

A third (and final) category of hypothesis tests remains. These apply to sit- 
uations where the independence of two random variables is being questioned. 
Examples are commonplace. Are the incidence rates of cancer related to mental 
health? Do a politician’s approval ratings depend on the gender of the respon- 
dents? Are trends in juvenile delinquency linked to the increasing violence in video 
games? In this section, we will modify the goodness-of-fit statistic D; in such a way 
that it can distinguish between events that are independent and events that are 
dependent. 


Testing for Independence: A Special Case 


A simple example is the best way to motivate the changes that need to be made 
to the structure of D; to make it capable of testing for independence. The key is 
Definition 2.5.1. 

Suppose A is some trait (or random variable) that has two mutually exclusive 
categories, A; and A,, and suppose that B is a second trait (or random variable) that 
also has two mutually exclusive categories, B; and B,. To say that A is independent 
of B is to say that the likelihoods of A; or Az occurring are not influenced by B; or 
By. More specifically, four separate conditional probability equations must hold if A 
and B are to be independent: 


P(A, | Bi) = P(A1) P(A, | By) = P(A)) 
P(A2| B}) = P(A2) P(A2 | Bo) = P(A2) 


By Definition 2.4.1, P(A)|B;) = “Poe, 


specified in Equation 10.5.1 are equivalent to 
P(A Bi)= P(A) P(B1) —-P(A1 By) = P(A1) P( Ba) 
P(A2 1 Bi) = P(A2)P(B}) ——-P(A2M By) = P(A) P( Ba) 


(10.5.1) 


for all i and j, so the conditions 


(10.5.2) 


Now, suppose a random sample of n observations is taken, and nj;; is defined to be 
the number of observations belonging to A; and B; (son =n, +112 +N +72). If 
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Table 10.5.1 
Trait B 
B, B, Row Totals 
TraitA “41 Mu Aw Ki 
2 Nz, N22 Ry 
Column totals: C, C; Tn 


we imagine the two categories of A and the two categories of B defining a matrix 
with two rows and two columns, the four observed frequencies can be displayed in 
the contingency table pictured in Table 10.5.1. 

If A and B are independent, the probability statements in Equation 10.5.2 would 
be true, and (by virtue of Theorem 10.2.2), the expected frequencies for the four 
combinations of A; and B; would be the entries shown in Table 10.5.2. 


Table 10.5.2 
Trait B 
B, By Row Totals 
Trait A A, nP(A,)P(B,) nP(A,;)P(B2) R 
Ad nP(A2)P(B,) nP(A2)P(B2) R, 
Column totals: C\ Cy [Tn 


Although P(A;), P(A2), P(B:), and P(B2) are unknown, they all have obvious 
estimates—namely, the sample proportion of the time that each occurs. That is, 


R C 

P(A))=— P(B))=— 
n n 

x R ie 

Puss fage? (10.5.3) 
n n 


Table 10.5.3, then, shows the estimated expected frequencies (corresponding to nj, 
N12, N21, and nz.) based on the assumption that A and B are independent. 


Table 10.5.3 
Trait B 
B, By 
ci A, R,C,/n R,C,/n 
mee Ad RC, /n R,Cz/n 


If traits A and B are independent, the observed frequencies in Table 10.5.1 
should agree fairly well with the estimated expected frequencies in Table 10.5.3 
because the latter were calculated under the presumption that A and B are indepen- 
dent. The analog of the test statistic d,, then, would be the sum d2, where 


(ou = BE)? (a BE)" | (mar BE)? (nmr — BY 


n 
Ric RiG R2Cp 


b= <a 


n n n n 
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In the event that d is “large,” meaning that one or more of the observed frequencies 
is substantially different from the corresponding estimated expected frequency, Ho: 
A and B are independent should be rejected. (In this simple case where both A and 
B have only two categories, D2 has approximately a x? pdf when Hp is true, so if a 
were set at 0.05, Hy would be rejected if d) > x95 , =3-841.) 


Testing for Independence: The General Case 


Suppose n observations are taken on a sample space S partitioned by the set of 
events A,, A2,..., A; and also partitioned by the set of events B,, B2,..., B-. That is, 


AjNA;=6 foralli ¢j and Lai=s 
i=l 


and 


BiNB;=9 foralli¥#j and Ja;=s 


Let the random variables X;;,i = 1,2,...,r, 7 =1,2,...,c, denote the number of 
observations that belong to A; N B;. Our objective is to test whether the A;’s are 
independent of the B;’s. 

Table 10.5.4 shows the two sets of events defining the rows and columns of an 
r X c matrix; the k;;’s that appear in the body of the table are the observed values of 
the X;;’s (recall Table 10.5.1). 


Table 10.5.4 

B, B ao B. Row Totals 
Aj ky ki kie R 
A kp, kay ky. Ry 
A, k, 1 kyo Kye R, 
Column totals Cc; C2 Cz n 


[Note: In the terminology of Section 10.2, the X;;’s are a set of rc multinomial ran- 
dom variables. Moreover, each individual X;; is a binomial random variable with 
parameters n and p;;, where p;; = P(A; B;).] 

Let pj = P(A;),i=1,2,...,r, and let gq; = P(B;), j=1,2,...,¢, so 


Invariably, the p,;’s and q,’s will be unknown, but their maximum likelihood 
estimates are simply the corresponding row and column sample proportions: 


Pi=Ri/n, po=Ro/n,..., Ppp=R,/n 
n=C\/n, Gz =C2/Nn,...,Ge=Ce/n 


(recall Equation 10.5.3). 
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Theorem 
10.5.1 


If the A;’s and B;’s are independent, then 
P(A; Bj) = P(A;) P( Bj) = pig; 


and the expected frequency corresponding to k;; would be np;q;,i = 1,2,...,7; 
j=1,2,...,c (recall the Comment following Theorem 10.2.2). Also, the estimated 
expected frequency for A; 1 B; would be 


npig; =n- R;:/n-Cj/n=R;,Cj/n (10.5.4) 


(recall Table 10.5.3). 

So, for each of the rc row-and-column combinations pictured in Table 10.5.4, we 
have an observed frequency (k;;) and an estimated expected frequency (R;C;/n) 
based on the null hypothesis that the A;’s are independent of the B,’s. The test 
statistic that would be analogous to d), then, would be the double sum d2, where 


r Cc AA 
(ki; —n pig)” 
dy = i — aE AL es 
ud ee 
Large values of d,; would be considered evidence against the independence 
assumption. 


Suppose that n observations are taken on a sample space partitioned by the events 
Aj, A2,..., A, and also by the events B,, Bz,..., Be. Let pi; = P(Ai) qj = P(Bj), 
and pij = P(A; N Bj), i=1,2,...,7; 7 =1,2,...,c. Let Xi; denote the number of 
observations belonging to the intersection A; B;. Then 


a. the random variable 


bye 3 Se (Xj; — npij)? 


i=l j=l MPij 
has approximately a x? distribution with rc — 1 degrees of freedom (provided 


npij = 5 for alli and j). 
b. to test Ho: the A;’s are independent of the B;’s, calculate the test statistic 


r c AOA 
sy: (ki; —n pig)” 
d = SA a 
° =e NPiqj 


where kj; is the number of observations in the sample that belong to Aj N Bj, i= 
1,2,...,r; j=1,2,...,c and p; and q; are the maximum likelihood estimates for 
pi and qj, respectively. The null hypothesis should be rejected at the a level of 
significance if 


2 
dz > X1_e,¢7—-1(e-1) 


(Analogous to the condition stipulated for all other goodness-of-fit tests, it will be 
assumed that np;q; = 5 for alli and j.) 


Comment In general, the number of degrees of freedom associated with a 
goodness-of-fit statistic is given by the formula 


df = number of classes — 1 — number of estimated parameters 


10.5 Contingency Tables 523 


(recall Theorem 10.4.1). For the double sum that defines d), 


number of classes = rc 


number of estimated parameters =r—1+c-1 


(because once r — | of the p;’s are estimated, the one that remains is predetermined 


by the fact that }° p; = 1; similarly, only c — 1 of the q;’s need to be estimated). But 


i=l 


re-—1—(r—1)-—(e-)D=(r—- 1)(c—- 1) 


Comment The x? distribution with (r — 1) (c — 1) degrees of freedom provides an 
adequate approximation to the distribution of d> only if np;g; > 5 for alli and j. If 
one or more cells in a contingency table have estimated expected frequencies that 
are substantially less than 5, the table should be “collapsed” and the rows and/or 
columns redefined. 


Case Study 10.5.1 


Gene Siskel and Roger Ebert were popular movie critics for a syndicated 
television show. Viewers of the program were entertained by the frequent 
flare-ups of acerbic disagreement between the two. They were immediately 
recognizable to a large audience of movie goers by their rating system of 
“thumbs up” for good films, “thumbs down” for bad ones, and an occasional 
“sideways” for those in between. 

Table 10.5.5 summarizes their evaluations of 160 movies (2). Do these num- 
bers suggest that Siskel and Ebert had completely different aesthetics —in which 
case their ratings would be independent—or do they demonstrate that the two 
shared considerable common ground, despite their many on-the-air verbal jabs? 


Table 10.5.5 
Ebert Ratings 
Down Sideways Up 
Down 24 8 13 
Siskel Sideways 8 13 11 
Ratings Up 10 2 64 
Total 42 30 88 


Using Equation 10.5.4, we can calculate the estimated expected number of 
times that both reviewers would say “thumbs down” if, in fact, their ratings were 
independent: 


Ri-C, _ (45)(42) 


E(Xi)= 160 


=11.8 


(Continued on next page) 
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(Case Study 10.5.1 continued) 


Table 10.5.6 displays the entire set of estimated expected frequencies, all 
calculated the same way. 


Table 10.5.6 
Ebert Ratings 


Down Sideways Up Total 


Down 24 8 13 45 
(11.8) (8.4) (24.8) 
Siskel Sideways 8 13 11 32 
Ratings (8.4) (6.0) (17.6) 
Up 10 9 64 83 
(21.8) (15.6) (45.6) 
Total 42 30 88 160 


Now, suppose we wish to test 


Ho: Siskel ratings and Ebert ratings were independent 
versus 


H;: Siskel ratings and Ebert ratings were dependent 


at the a =0.01 level of significance. With r = 3 and c =3, the number of degrees 
of freedom associated with the test statistic is (3 — 1)(3 — 1) =4, and Hp should 
be rejected if 


da = X509.4= 13.277 
But 
= Ary (8 — 8.4)? 7 (64 — 45.6)” 
11.8 8.4 45.6 
= 45.37 


so the evidence is overwhelming that Siskel and Ebert’s judgments were not 
independent. 


“Reducing” Continuous Data to Contingency Tables 


Most applications of contingency tables begin with qualitative data, Case Study 
10.5.1 being a typical case in point. Sometimes, though, contingency tables can pro- 
vide a particularly convenient format for testing the independence of two random 
variables that initially appear as quantitative data. If those x and y measurements 
are each reduced to being either “high” or “low,” for example, the original x;’s and 
y;’s become frequencies in a 2 x 2 contingency table (and can be used to test Ho: X 
and Y are independent). 
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Case Study 10.5.2 


Sociologists have speculated that feelings of alienation may be a major factor 
contributing to an individual’s risk of committing suicide. If so, cities with more 
transient populations should have higher suicide rates than urban areas where 
neighborhoods are more stable. Listed in Table 10.5.7 is the “mobility index” (y) 
and the “suicide rate” (x) for each of twenty-five U.S. cities (210). (Note: The 
mobility index was defined in such a way that smaller values of y correspond to 
higher levels of transiency.) Do these data support the sociologists’ suspicion? 


Table 10.5.7 
Suicides per Mobility Suicides per Mobility 

City 100,000, x; Index, y; City 100,000, x; Index, y; 
New York 19.3 54.3 Washington 22.5 37.1 
Chicago 17.0 51.5 Minneapolis 23.8 56.3 
Philadelphia 17.5 64.6 New Orleans 17.2 82.9 
Detroit 16.5 42.5 Cincinnati 23.9 62.2 
Los Angeles 23.8 20.3 Newark 21.4 51.9 
Cleveland 20.1 52.2 Kansas City 24.5 49.4 
St. Louis 24.8 62.4 Seattle 31.7 30.7 
Baltimore 18.0 72.0 Indianapolis 21.0 66.1 
Boston 14.8 59.4 Rochester 17.2 68.0 
Pittsburgh 14.9 70.0 Jersey City 10.1 56.5 
San Francisco 40.0 43.8 Louisville 16.6 78.7 
Milwaukee 19.3 66.2 Portland 29.3 33.2 
Buffalo 13.8 67.6 


To reduce these data to a 2 x 2 contingency table, we redefine each x; as 


=99 


being either “> x” or “<x” and each y; as being either “> y” or “< y.” Here, 


19.34+17.0+---+29.3 _ 


20.8 
25 


Xx 


and 
Sa a1 Saber 332 
25 7 


so the twenty-five (x;, y;)’s produce the 2 x 2 contingency table shown in 
Table 10.5.8. 


56.0 


y= 


Table 10.5.8 


Mobility Index 


Low (<56.0) High (=56.0) 


Suicide High (>20.8) 7 4 
Rate Low (<20.8) 3 11 


(Continued on next page) 
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(Case Study 10.5.2 continued) 


If X and Y are independent, the four estimated expected frequencies asso- 
ciated with the contingency table (and calculated from Equation 10.5.4) are the 
entries appearing in Table 10.5.9. 


Table 10.5.9 


Mobility Index 


Low (<56.0) High (=56.0) 


Suicide High (>20.8) 4.4* 6.6 
Rate Low (<20.8) 5.6 8.4 
*E(X1;) = 4.4 does not quite satisfy the “np,q; = 5” restriction stated in 


Theorem 10.5.1, but 4.4 is close enough to 5 to maintain the integrity of the x? 
approximation. 


Substituting into the test statistic from Theorem 10.5.1 gives 
(7-44)? (4-66)? (-—5.6)? (11-—8.4)* 
AA 6.6 5.6 8.4 
= 4.57 
With (r — 1)(c— 1) =(2— 1)(2— 1) = 1 df, the a = 0.05 critical value associated 
with dy is X95 ; = 3.841. The appropriate conclusion, then, is to reject Ho since 


dy > 3.841 —the sociologists’ suspicion that suicide rates and transiency in urban 
areas are dependent is borne out by the data. 


Case Study 10.5.3 


Beginning in 1647, witchcraft accusations, trials, and executions were an 
on-again, off-again phenomenon in the New England colonies. Occasionally, 
Puritan angst over matters Satanic would flare up like an epidemic, and entire 
communities would become convinced that many of their neighbors were in 
league with the Devil. The most famous of these paranoid outbreaks were the 
Salem witch trials that occurred in 1692 and 1693. 

Altogether, a total of 185 adults (and children) were accused of witchcraft 
in Salem during those two years, 141 females and 44 males. Fourteen of the 
women (9.9%) were eventually hanged; a similar fate befell five (11.4%) of the 
men (90). Is the difference between 9.9% and 11.4% statistically significant? 

Recall the discussion on p. 447. Testing whether the difference between 
two independent binomial proportions is statistically significant is equivalent 
to testing whether the two factors represented in a 2 x 2 contingency table are 
independent —that is, Ho: px = py will be rejected if and only if we reject Ho: X 
and Y are independent. 

Table 10.5.10 shows the Salem data presented as a 2 x 2 contingency 
table. In parentheses are the expected frequencies calculated under the 


(Continued on next page) 
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independent of gender. 


null hypothesis that the chances of an accused witch being executed were 


Table 10.5.10 
Accused Witches 
Female Male 
14 5 19 
Executed (14.5) (4.5) 
Not 127 39 166 
Executed (126.5) | (39.5) 
185 
Totals 141 44 


critical value is 


By part (b) of Theorem 10.5.1, d, = 0.08 (with 1 df). If the independence 
hypothesis is to be tested at the a = 0.05 level of significance, the appropriate 


2 2 2 
X1-a,(r—1)(c—1) = X0.95,(2—) 2-1) = X0.95,1 = 3.84 


so the conclusion is “fail to reject Hp.” 

If these data were viewed as two independent sets of Bernoulli trials of sizes 
141 and 44, respectively, the appropriate test statistic would be the observed Z 
ratio described in Theorem 9.4.1. With x = 14,n=141, y=5,m=44, z= —0.28, 
and a = 0.05, critical values would be £1.96 (see Question 10.5.10), implying 
that the difference between 9.9% and 11.4% is not statistically significant. 


Questions 


10.5.1 Market researchers often gather information by 
telephone, but calling only listed numbers may badly skew 
the responses, if listed and unlisted households are fun- 
damentally different with respect to the question being 
asked. The following is the slightly modified summary of 
a survey done by Pacific Bell to see whether homeowner- 
ship is related to telephone listing (142). At the a = 0.05 
and a = 0.10 levels of significance, test whether those two 
“conditions” are independent. 


Listed Unlisted 


628 146 
172 54 


Own 
Rent 


10.5.2 Many factors influence a company’s decision to 
relocate to another site. The state of Florida, hoping to 
attract such relocations, sponsored a study (50) on how 
different companies view various factors. One part of the 
study compared the importance of a high-quality work- 
force to manufacturing firms and to nonmanufacturing 


firms. At the a =0.05 level of significance, do the following 
data suggest that the importance of a high-quality work- 
force is not viewed the same by all types of businesses? 


Manufacturing Other 


Extremely 168 73 
Importance or somewhat 
Not very 42 26 


10.5.3 A total of 1154 girls attending a public high school 
were given a questionnaire that measured how much each 
had exhibited delinquent behavior (124). From an analysis 
of the results, the researchers categorized 111 of the girls 
as “delinquent.” The following is a cross-classification of 
the delinquents and the nondelinquents according to their 
birth order. At the w = 0.01 level of significance, is there 
evidence here to support the contention that birth order 
and delinquency are related? 
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Delinquent Not Delinquent 
Oldest 24 450 
In Between 29 312 
Youngest 35 211 
Only Child 23 70 


10.5.4 Recall the rubella/birth defect study described in 
Case Study 8.1.3. At the a =0.01 level of significance, 
can it be concluded that the risk of an abnormal birth is 
affected by when a rubella infection is contracted during 
pregnancy? 


10.5.5 Research has suggested that regular use of aspirin 
or other nonsteroidal anti-inflammatory drugs (NSAIDs) 
may be effective in reducing the risk of breast cancer. 
In one recent study (179), 1442 women with breast can- 
cer were asked whether they had used aspirin regularly 
one year prior to their diagnosis; 301 said “yes.” Among 
a matched control group of 1420 women without breast 
cancer, 345 reported that they were regular aspirin users. 
What would you conclude? Set up and test an appropriate 
hypothesis. Let 0.05 be the level of significance. 


10.5.6 High blood pressure is known to be one of the 
major contributors to coronary heart disease. A study was 
done to see whether or not there is a significant relation- 
ship between the blood pressures of children and those of 
their fathers (88). If such a relationship did exist, it might 
be possible to use one group to screen for high-risk indi- 
viduals in the other group. The subjects were 92 eleventh 
graders, 47 males and 45 females, and their fathers. Blood 
pressures for both the children and the fathers were cate- 
gorized as belonging to either the lower, middle, or upper 
third of their respective distributions. Test whether or not 
the blood pressures of children can be considered to be 
independent of the blood pressures of their fathers. Let 
a =0.05. 


Child’s Blood Pressure 
Lower Middle Upper 
Third Third Third 
Father’s Lower third 14 11 8 
blood Middle third 11 11 9 
pressure Upper third 6 10 12 


10.5.7 The following data were collected as part of a 
study to see whether a mouse’s early upbringing has any 
effect on its aggressiveness later in life (84). A total of 
307 mice were divided into two groups shortly after birth. 
Each of the 167 mice in the first group was raised by its 
natural mother; the remaining 140 in the second group 


were raised by “foster” mice. When each mouse was three 
months old, it was put into a small cage with another 
mouse it had not seen before. The two were then watched 
for a predetermined period of time (six minutes) to see 
whether they would start fighting. Set up and carry out an 
appropriate x* test. Let a = 0.05. 


Natural Mother Foster Mother 
Number fighting 27 47 
Number not fighting 140 293, 
167 140 


10.5.8 The Hopwood Decision resulted from a 1996 US. 
Fifth Circuit Court of Appeals case that greatly limited 
Texas universities’ affirmative-action programs for admis- 
sion of minority students. As a consequence, minority 
enrollment dropped significantly. One solution proposed 
was to accept all students in the top 10% of their graduat- 
ing class. The success of such a plan in achieving diversity 
would hinge on the enrollment rates for the different 
racial groups. The following are the average numbers of 
freshmen in the top 10% of their classes admitted and 
enrolled, by race, at UT-Austin for the years 1990-1996. 
Are the enrollment rates dependent on the racial groups? 
Do the appropriate analysis using the a = 0.05 level of 
significance. 


Admitted Enrolled 


White 2592 1481 
African-American 159 78 
Hispanic 800 375 
Asian 667 399 


10.5.9 Portfolio turnover expresses the past year’s trad- 
ing activity as a percentage of an account’s average assets. 
The following table summarizes the performances of one 
hundred mutual funds cross-classified according to port- 
folio turnover and annual return. Test the independence 
assumption. Let a = 0.05. 


Annual Return 


<10% +10% 
Portfolio >100% 11 10 
Return <100% 55 24 


10.5.10 (a) For the witchcraft data described in Case 
Study 10.5.3, verify that z= —0.28. 

(b) Notice that (—0.28)? = 0.08 and (1.96)? = 3.84. Why 
should those equalities be true? 
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10.6 Taking a Second Look at Statistics (Outliers) 


This chapter has explored important questions related to the “pedigree” of a set of 
data. Given the measurements y, y2,..., Yn, for example, is it believable that they 
represent a random sample from some particular pdf, f(y)? Or, given a set of bivari- 
ate observations, (x, y), (%2, y2),---,; Xn» Yn), fepresenting the random variables X 
and Y, is it believable that X and Y are independent? 

In practice, experimenters sometimes encounter a slightly different sort of pedi- 
gree question, one that focuses on individual measurements rather than on entire 
data sets. For example, suppose a laboratory experiment has yielded the twenty 
observations listed in Table 10.6.1. Grouped into classes of width 10, the data have 
the frequency distribution shown in Table 10.6.2. The question is, what (if anything) 
should be done with the measurement y = 127.6 that lies considerably to the right 
of the rest of the data? Is it simply the largest observation in the sample (in which 
case it should be kept), or does its separation from the bulk of the distribution reflect 
some sort of fundamental measurement error (in which case it should be discarded)? 


Table 10.6.1 


73.55 45.6 51.2 15.6 49.2 
55.7 24.8 127.6 49.7 53.8 
91.6 82.9 78.4 58.4 67.9 
44.3 62.4 374 30.8 59.6 


Table 10.6.2 


Observation Frequency 


100<y< 200 
20.0<y< 30.0 
30.0<y< 40.0 
40.0<y< 50.0 
50.0<y< 60.0 
60.0<y< 70.0 
70.0<y< 80.0 
80.0<y< 90.0 
90.0<y< 100.0 

100.0<y< 110.0 

110.0<y< 120.0 

120.0<y< 130.0 


RPOSORPHENNUANHH 


While there is no way to answer that question with any certainty, there are test 
procedures that can shed some light on the likelihood (subject to certain assump- 
tions) of an “outlier” being a sample from the same pdf that generated all the other 
observations. One such procedure, due to Dixon (38), assumes that the observations 
are coming from a normal distribution and is based on either the ratio 

/ 4 tA f- 
nee a or r= z 
Yn — Ji Yn — ¥1 
where y; is the ith order statistic in the sample of size n. If the potential outlier is 
the largest observation in the sample, the test statistic is ro1; if the potential outlier is 
the smallest observation, the test statistic is rjo. 
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Table 10.6.3 gives upper percentage points for the distribution of ro; (and rio) as 
a function of the sample size n. For the data in Table 10.6.1, n = 20 and the largest 


observation is the measurement in question, so the test statistic is 


_ Y0 — Yio 

Yo = 1 

From Table 10.6.1, y; = 15.6, yiJ = 91.6, and y}, = 127.6, so 
127.6—91.6 

~ 127.6 — 15.6 


= 0.32 


According to Table 10.6.3, the P-value associated with the outcome ro; = 0.32 is 
between 0.05 and 0.02, since the 95th percentile of the ro, distribution when n = 20 


Table 10.6.3 
PERCENTAGE POINTS OF THE DISTRIBUTION OF rio 
Be 80 90 95 98 99 995 
3 781 886 941 976 988 994 
4 560 679 765 846 889 926 
5 451 557 642 729 780 821 
6 386 482 560 644 698 740 
7 344 434 507 586 637 680 
8 314 399 468 543 590 634 
9 290 370 437 510 555 598 
10 273 349 A12 483 527 568 
11 259 332 392 460 502 542 
12 247 318 376 AAI 482 522 
13 237 305 361 425 465 503 
14 228 294 349 All 450 488 
15 220 285 338 399 438 ATS 
16 213 277 329 388 426 463 
17 207 269 320 379 416 452 
18 202 263 33 370 407 442 
19 197 258 306 363 398 433 
20 193 252 300 356 391 425 
21 189 247 295 350 384 418 
39 185 242 290 344 378 All 
pe 182 238 285 338 370 404 
24 179 234 281 333 367 399 
25 176 230 277 329 362 393 
26 173 ey, 273 324 357 388 
27 171 224 269 320 353 384 
28 168 220 266 316 349 380 
29 166 218 263 312 345 376 
30 164 215 260 309 341 372 


Source: Dunn, Olive Jean and Clark, Virginia A. Applied Statistics: Analysis of Variance and Regression. New York: 


John Wiley & Sons, 1974, p. 374. 
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is 0.300, and the 98th percentile is 0.356. So, should y5, be discarded? Probably not, 
unless there is reason to believe that its large value was the result of a measurement 
error. The distribution of ro; makes it clear that a value of 127.6 in this case is not 
dramatically out of line. 

A word of caution—the simplicity of testing for outliers does not mean the pro- 
cedure should be used capriciously. More than a few experimenters have woefully 
regretted discarding suspicious observations under the guise of “cleaning up” their 
data. Sometimes the observations not fitting the presumed model constitute the 
most important information in a data set, because they may be the first and only 
clues that the presumed model is, in fact, incorrect. 


Appendix 10.A.1 Minitab Applications 


Figure 10.A.1.1 


The Minitab command CHISQUARE, followed by the columns in which the 
observed frequencies have been entered, performs the x? test for independence 
described in Theorem 10.5.1. Figure 10.A.1.1 shows the input and output for the 
Minitab analysis of the data in Case Study 10.5.1. In addition to the estimated 
expected frequencies and the value of the test statistic, the CHISQUARE rou- 
tine also indicates the number of degrees of freedom associated with dy and its 
P-value. Here the P-value is so small there is no question that the null hypothesis 
of independence should be rejected. 


TB > set cl 

DATA > 24 8 10 

DATA > end 

TB > set c2 

DATA > 8 13 9 

DATA > end 

TB > set c3 

DATA > 13 11 64 

DATA > end 

TB > chisquare cl-c3 


Chi Square Test: C1, C2, C3 


Expected counts are printed below observed counts 
Chi square contributions are printed below expected counts 


er C2 C3 Total 

1 24 8 13 45 
11.81 8.44 24.75 
12.574 0.023 5.578 

2 8 13 11 32 
8.40 6.00 17.60 
0.019 8.167 2.475 

3 10 9 64 83 
21.79 15.56 45.65 
6.377 2.767 7.376 

Total 42 30 88 160 


Chi-Sq =45.357, DF =4, P-Value =0.000 


Testing for Independence Using Minitab Windows 


1. Enter each column of observed frequencies in a separate column. 
2. Click on STAT, then on TABLES, then on CHISQUARE TEST. 
3. Enter the columns containing the data, and click on OK. 
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Galton had earned a Cambridge mathematics degree and completed two years of 

medical school when his father died, leaving him with a substantial inheritance. Free 

to travel, he became an explorer of some note, but when The Origin of Species was 

published in 1859, his interests began to shift from geography to statistics and 

anthropology (Charles Darwin was his cousin). It was Galton’s work on fingerprints 

that made possible their use in human identification. He was knighted in 1909. 
—Francis Galton (1822-1911) 


11.1 Introduction 


High on the list of problems that experimenters most frequently need to deal with 
is the determination of the relationships that exist among the various components 
of a complex system. If those relationships are sufficiently understood, there is a 
good possibility that the system’s output can be effectively modeled, maybe even 
controlled. 

Consider, for example, the formidable problem of relating the incidence of can- 
cer to its many contributing causes—diet, genetic makeup, pollution, and cigarette 
smoking, to name only a few. Or think of the Wall Street financier trying to antici- 
pate trends in stock prices by tracking market indices and corporate performances, 
as well as the overall economic climate. In those situations, a host of variables are 
involved, and the analysis becomes very intricate. Fortunately, many of the fun- 
damental ideas associated with the study of relationships can be nicely illustrated 
when only two variables are involved. This two-variable model will be the focus of 
Chapter 11. 

Section 11.2 gives a computational technique for determining the “best” equa- 
tion describing a set of points (x1, y1), (2, y2),.-., and (%», yn), where best is defined 
geometrically. Section 11.3 adds a probability distribution to the y-variable, which 
allows for a variety of inference procedures to be developed. The consequences 
of both measurements being random variables is the topic of Section 11.4. Then 
Section 11.5 takes up a special case of Section 11.4, where the variability in X and Y 
is described by the bivariate normal pdf. 
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11.2 The Method of Least Squares 


We begin our study of the relationship between two variables by asking a simple 

geometry question. Given a set of n points—(x1, v1), (%2, Y2),---, (Xn, Yn) —and a 

positive integer m, which polynomial of degree m is “closest” to the given points? 
Suppose that the desired polynomial, p(x), is written 


m 
P(x)=at+ > bjx! 

i=l 
where a, b,,..., bm are to be determined. The method of least squares answers the 
question by finding the coefficient values that minimize the sum of the squares of 
the vertical distances from the data points to the presumed polynomial. That is, the 
polynomial p(x) that we will call “best” is the one whose coefficients minimize the 
function L, where 


L= De pri] 


Theorem 11.2.1 summarizes the oe of least squares as it applies to the impor- 
tant special case where p(x) is a linear polynomial. (Note: To simplify notation, the 
linear polynomial y =a + b,x! will be written y =a + bx.) 


Given n points (x1, y), (2, y2), -.+) (Xn, Yn), the straight line y =a + bx minimizing 


L= ae —(a+bx;)P 
has slope 
ny xi yi - ( si) i ») 
i=l i=l i=l 
b 2 
“(E8) Es 
i=1 i=l 
and y-intercept 
ww - by x 
i=l i=l i 
a= =y—bx 


Proof The proof is accomplished by the familiar calculus technique of taking the 
partial derivatives of L with respect to a and b, setting the resulting expressions 
equal to 0, and solving. By the first step we get 


n 


aL 
=)_(—2)xily 
py 


(a + bx;)] 


and 
n 


OL 
a = 2 2)Ly 


Setting the right-hand sides of dL/da and d0L/db equal to 0 and simplifying 
yields the two equations 


(a+ bx;)] 


na+ (x2 ey 
i=l i=1 
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and 


(>») a+ (3-3) b= Ss 
i=l i=l i=l 


An application of Cramer’s rule gives the solution for b stated in the theorem. The 
expression for a follows immediately. 


Case Study 11.2.1 


A manufacturer of air conditioning units is having assembly problems due to 
the failure of a connecting rod to meet finished-weight specifications. Too many 
rods are being completely tooled, then rejected as overweight. To reduce that 
cost, the company’s quality-control department wants to quantify the relation- 
ship between the weight of the finished rod, y, and that of the rough casting, x 
(139). Castings likely to produce rods that are too heavy can then be discarded 
before undergoing the final (and costly) tooling process. 

As a first step in examining the xy-relationship, twenty-five (x;, y;) pairs 
are measured (see Table 11.2.1). Graphed, the points suggest that the weight 
of the finished rod is linearly related to the weight of the rough casting (see 
Figure 11.2.1). Use Theorem 11.2.1 to find the best straight line approximating 
the xy-relationship. 


Table 11.2.1 

Rod Rough Finished Rod Rough Finished 
Number Weight,x Weight,y Number Weight,x Weight, y 

1 2.745 2.080 14 2.635 1.990 

2 2.700 2.045 15 2.630 1.990 

3 2.690 2.050 16 2.625 1.995 

4 2.680 2.005 17 2.625 1.985 

5 2.675 2.035 18 2.620 1.970 

6 2.670 2.035 19 2.615 1.985 

7 2.665 2.020 20 2.615 1.990 

8 2.660 2.005 21 2.615 1.995 

9 2.655 2.010 22 2.610 1.990 

10 2.655 2.000 23 2.590 1.975 

11 2.650 2.000 24 2.590 1.995 

12 2.650 2.005 20) 2.565 1.955 

13 2.645 2.015 


From Table 11.2.1, we find that 


25 25 
>> x; =66.075 > x2 =174.672925 


i=l i=l 


25 25 
> y; = 50.12 J> y? = 100.49865 


i=1 i=l 


25 
Y- x; 9; = 132.490725 


i=1 


(Continued on next page) 
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y = 0.308 + 0.642x 


Finished weight 


eg 
So 
S 


(3 Yt 
2.50 295 2.60 2.65 2.70 2.75 
Rough weight 


Figure 11.2.1 


Therefore, 
__ 25(132.490725) — (66.075) (50.12) _ oes 
~ -25(174.672925) — (66.075)2 
and 
12 — 0.642(66.07 
ane 0.642(66.0 6408 


25 


making the least squares line 
y = 0.308 + 0.642x 


The manufacturer is now in a position to make some informed policy deci- 
sions. If the weight of a rough casting is, say, 2.71 oz., the least squares line 
predicts that its finished weight will be 2.05 oz.: 


estimated weight = a + b(2.71) = 0.308 + 0.642(2.71) = 2.05 


In the event that finished weights of 2.05 oz. are considered to be too heavy, 
rough castings weighing 2.71 oz. (or more) should be discarded. 


Residuals 


The difference between an observed y; and the value of the least squares line when 
x =x; 1s called the ith residual. Its magnitude reflects the failure of the least squares 
line to “model” that particular point. 


Definition 11.2.1. Let a and b be the least squares coefficients associated with 
the sample (x,, y:), (2, y2),---; (%n, Yn). For any value of x, the quantity j = 
a+ bx is known as the predicted value of y. For any given i,i=1,2,...,n, the 
difference y; — 3; = y; — (a+ bx;) is called the ith residual. A graph of y; — 3; 
versus x;, for all i, is called a residual plot. 
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Example 
11.2.1 


Interpreting Residual Plots 


Applied statisticians find residual plots to be very helpful in assessing the appro- 
priateness of fitting a straight line through a given set of n points. If the 
relationship between x and y is linear, the corresponding residual plot typi- 
cally shows no patterns, cycles, trends, or outliers. For nonlinear relationships, 
though, residual plots often take on dramatically nonrandom appearances that 
can very effectively highlight and illuminate the underlying association between 
x and y. 


Make the residual plot for the data in Case Study 11.2.1. What does its appearance 
imply about the suitability of fitting those points with a straight line? 

We begin by calculating the residuals for each of the twenty-five data points. 
The first observation recorded, for example, was (x1, yi) = (2.745, 2.080). The 
corresponding predicted value, },, is 2.070: 


J1 = 0.308 + 0.642(2.745) 
= 2.070 


The first residual, then, is y; — }; = 2.080 — 2.070, or 0.010. The complete set of 
residuals appears in the last column of Table 11.2.2. 


Table 11.2.2 

Xj Ji Si Yi Si 
2.745 2.080 2.070 0.010 
2.700 2.045 2.041 0.004 
2.690 2.050 2.035 0.015 
2.680 2.005 2.029 —0.024 
2.675 2.035 2.025 0.010 
2.670 2.035 2.022 0.013 
2.665 2.020 2.019 0.001 
2.660 2.005 2.016 —0.011 
2.655 2.010 2.013 —0.003 
2.655 2.000 2.013 —0.013 
2.650 2.000 2.009 —0.009 
2.650 2.005 2.009 —0.004 
2.645 2.015 2.006 0.009 
2.635 1.990 2.000 —0.010 
2.630 1.990 1.996 —0.006 
2.625 1.995 1.993 0.002 
2.625 1.985 1.993 —0.008 
2.620 1.970 1.990 —0.020 
2.615 1.985 1.987 —0.002 
2.615 1.990 1.987 0.003 
2.615 1.995 1.987 0.008 
2.610 1.990 1.984 0.006 
2.590 1.975 1.971 0.004 
2.590 1.995 1.971 0.024 
2.565 1.955 1.955 0.000 
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Figure 11.2.2 shows the residual plot generated by fitting the least squares 
straight line, y = 0.308 + 0.642x, to the twenty-five (x;, y;)’s. To an applied statisti- 
cian, there is nothing here that would raise any serious doubts about using a straight 
line to describe the x y-relationship—the points appear to be randomly scattered and 
exhibit no obvious anomalies or patterns. a 


Case Study 11.2.2 


Table 11.2.3 lists Social Security expenditures for five-year intervals from 1965 
to 2005. During that period, payouts rose from $19.2 billion to $529.9 billion. 
Substituting these nine (x;, y;)’s into the formulas in Theorem 11.2.1 gives 


y=—38.0+ 12.9x 


Table 11.2.3 
Social Security Expenditures 

Year Years after 1965, x ($ billions), y 
1965 0 19.2 
1970 5 33.1 
1975 10 69.2 
1980 15 123.6 
1985 20 190.6 
1990 25 253.1 
1995 30 339.8 
2000 35 415.1 
2005 40 529.9 
Source: www.socialsecurity.gov/history/trustfunds.html. 


(Continued on next page) 
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(Case Study 11.2.2 continued) 

as the least squares straight line describing the xy-relationship. Based on the 
data from 1965 to 2005, is it reasonable to predict that Social Security costs in 
the year 2010 (when x = 45) will be $543 billion [= —38.0 + 12.9(45)]? 

Not at all. At first glance, the least squares line does appear to fit the data 
quite well (see Figure 11.2.3). A closer look, though, suggests that the under- 
lying xy-relationship may be curvilinear rather than linear. The residual plot 
(Figure 11.2.4) confirms that suspicion—there we see a distinctly nonrandom 
pattern. 
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Figure 11.2.4 

Clearly, extrapolating these data would be foolish. The figure for the next 
year, 2006, of $555 billion already exceeded the linear projection of $543 billion, 
leading economists to predict rapidly accelerating expenditures in the future. 


Comment For the data in Table 11.2.3, the suggestion that the xy-relationship may 
be curvilinear is certainly present in Figure 11.2.3, but the residual plot makes the 
case much more emphatically. In point of fact, that will often be the case, which is 
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why residual plots are such a valuable diagnostic tool—departures from randomness 
that may be only hinted at in an xy-plot will be exaggerated and highlighted in the 
corresponding residual plot. 


Case Study 11.2.3 


A new, presumably simpler laboratory procedure has been proposed for recov- 
ering calcium oxide (CaO) from solutions that contain magnesium. Critics 
of the method argue that the results are too dependent on the person who 
performs the analysis. To demonstrate their concern, they arrange for the pro- 
cedure to be run on ten samples, each containing a known amount of CaO. Nine 
of the ten tests are done by Chemist A; the other is run by Chemist B. Based on 
the results summarized in Table 11.2.4, does their criticism seem justified? 


Table 11.2.4 

Chemist CaO Present (in mg),x CaO Recovered (in mg), y 
A 4.0 3.7 
A 8.0 78 
A 12.5 12.1 
A 16.0 15.6 
A 20.0 19.8 
A 25.0 24.5 
B 31.0 31.1 
A 36.0 35.5 
A 40.0 39.4 
A 40.0 39.5 


Figure 11.2.5 shows the scatterplot of y versus x. The linear function 
appears to fit all ten points exceptionally well, which would suggest that the crit- 
ics’ concerns are unwarranted. But look at the residual plot (Figure 11.2.6). The 
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(Case Study 11.2.3 continued) 
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Chemist B 


Chemist A 


Figure 11.2.6 


latter shows one point located noticeably further away from zero than any of 
the others, and that point corresponds to the one measurement attributed to 
Chemist B. So, while the scatterplot has failed to identify anything unusual 
about the data, the residual plot has focused on precisely the question the data 


Does the appearance of the residual plot—specifically, the separation 
between the Chemist B data point and the nine Chemist A data points— 
“prove” that the output from the new procedure is dependent on the analyst? 
No, but it does speak to the magnitude of the disparity and, in so doing, provides 
the critics with at least a partial answer to their original question. 


Questions 


11.2.1. Crickets make their chirping sound by sliding one 
wing cover very rapidly back and forth over the other. 
Biologists have long been aware that there is a linear 
relationship between temperature and the frequency with 
which a cricket chirps, although the slope and y-intercept 
of the relationship vary from species to species. The fol- 
lowing table lists fifteen frequency-temperature observa- 
tions recorded for the striped ground cricket, Nemobius 
fasciatus fasciatus (135). Plot these data and find the equa- 
tion of the least squares line, y=a-+ bx. Suppose a cricket 
of this species is observed to chirp eighteen times per 
second. What would be the estimated temperature? 
For the data in the table, the sums needed are: 


15 15 
Yo x; = 249.8 Y | x7 = 4,200.56 
i=l 


i=l 


15 15 


yy =1,200.6 =) x;y; = 20,127.47 


i=l i=l 


Observation Chirps per Second, Temperature, 


Number x y (°F) 
1 20.0 88.6 
2 16.0 71.6 
3 19.8 93.3 
4 18.4 84.3 
5 17.1 80.6 
6 15.5 75.2 
7 14.7 69.7 
8 17.1 82.0 
9 15.4 69.4 

10 16.2 83.3 
11 15.0 79.6 
12 17.2 82.6 
13 16.0 80.6 
14 17.0 83.5 
15 14.4 76.3 


11.2.2. The aging of whisky in charred oak barrels brings 
about a number of chemical changes that enhance its taste 
and darken its color. The following table shows the change 
in a whisky’s proof as a function of the number of years it 
is stored (159). 


Age, x (years) Proof, y 


0 104.6 
0.5 


AANMBWNFR 
— 
=) 
— 
~ 


(Note: The proof initially decreases because of dilution by 
moisture in the staves of the barrels.) Graph these data 
and draw in the least squares line. 


11.2.3. As water temperature increases, sodium nitrate 
(NaNO;) becomes more soluble. The following table (103) 
gives the number of parts of sodium nitrate that dissolve 
in one hundred parts of water. 


Temperature 


(degrees Celsius), x Parts Dissolved, y 


0 66.7 

4 71.0 
10 76.3 
15 80.6 
21 85.7 
29 92.9 
36 99.4 
51 113.6 
68 125.1 


Calculate the residuals, y,-},,..., yo—¥o, and draw the 
residual plot. Does it suggest that fitting a straight line 
through these data would be appropriate? Use the follow- 
ing sums: 


9 
oxi =234 
i=l 


Yo xiyi =24,628.6 


i=l 


9 
So x7 = 10,144 
i=l 
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11.2.4. What, if anything, is unusual about the following 
residual plots? 


Residual 
° 
° 
Residual 


11.2.5. The following is the residual plot that results from 
fitting the equation y = 6.0+ 2.0x to a set of n = 10 points. 
What, if anything, would be wrong with predicting that y 
will equal 30.0 when x = 12? 


Residual 


11.2.6. Would the following residual plot produced by fit- 
ting a least squares straight line to a set of n = 13 points 
cause you to doubt that the underlying xy-relationship is 
linear? Explain. 


Residual 
°o 


11.2.7. The relationship between school funding and stu- 
dent performance continues to be a hotly debated political 
and philosophical issue. Typical of the data available are 
the following figures, showing the per-pupil expenditures 
and graduation rate for 26 randomly chosen districts in 
Massachusetts. 

Graph the data and superimpose the least squares 
ine, y=a+bx. What would you conclude about the 
xy-relationship? Use the following sums: 


26 26 
yx; =360 >| yi =2,256.6 
i=l i=l 

26 26 
yx} =5,365.08 — S xiy; = 31,402 


i=1 i=l 
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Spending per 
Pupil 
District (in 1000s),x Graduation Rate 
Dighton-Rehoboth $10.0 88.7 
Duxbury $10.2 93.2 
Tyngsborough $10.2 95.1 
Lynnfield $10.3 94.0 
Southwick-Tolland $10.3 88.3 
Clinton $10.8 89.9 
Athol-Royalston $11.0 67.7 
Tantasqua $11.0 90.2 
Ayer $11.2 95.5 
Adams-Cheshire $11.6 75.2 
Danvers $12.1 84.6 
Lee $12.3 85.0 
Needham $12.6 94.8 
New Bedford $12.7 56.1 
Springfield $12.9 54.4 
Manchester Essex $13.0 97.9 
Dedham $13.9 83.0 
Lexington $14.5 94.0 
Chatham $14.7 91.4 
Newton $15.5 94.2 
Blackstone Valley $16.4 97.2 
Concord Carlisle $17.5 94.4 
Pathfinder $18.1 78.6 
Nantucket $20.8 87.6 
Essex $22.4 93.3 
Provincetown $24.0 92.3 


Source: profiles.doe.mass.edu/state-report/ppx.aspx. 


11.2.8. (a) Find the equation of the least squares straight 
line for the plant cover diversity/bird species diversity data 
given in Question 8.2.11. 

(b) Make the residual plot associated with the least 
squares fit asked for in part (a). Based on the appear- 
ance of the residual plot, would you conclude that fitting a 
straight line to these data is appropriate? Explain. 


11.2.9. A nuclear plant was established in Hanford, 
Washington, in 1943. Over the years, a significant amount 
of strontium 90 and cesium 137 leaked into the Columbia 
River. In a study to determine how much this radioactiv- 
ity caused serious medical problems for those who lived 
along the river, public health officials created an index 
of radioactive exposure for nine Oregon counties in the 
vicinity of the river. As a covariate, cancer mortality was 
determined for each of the counties (40). The results are 
given in the table in the next column. For the nine (x;, y;)’s 
in the table, 


9 
do x; = 41.56 


i=l 


9 
Y x? = 289.4222 
i=1 


9 
Se Xi = 7,439.37 


i=1 


9 
yy; =1,416.1 
i=1 


Cancer Mortality 
County Index of Exposure per 100,000 
Umatilla 2.49 147.1 
Morrow 2.57 130.1 
Gilliam 3.41 129.9 
Sherman 1.25 113.5 
Wasco 1.62 137.5 
Hood River 3.83 162.3 
Portland 11.64 207.5 
Columbia 6.41 177.9 
Clatsop 8.34 210.3 


Find the least squares straight line for these points. Also, 
construct the corresponding residual plot. Does it seem 
reasonable to conclude that x and y are linearly related? 


11.2.10. Would you have any reservations about fitting 
the following data with a straight line? Explain. 


a 
S 


ao 


a 
BNOARFANANOR NAW 
eS) 
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11.2.1 1. When two closely related species are crossed, the 
progeny will tend to have physical traits that lie some- 
where between those of the two parents. Whether a similar 
mixing occurs with behavioral traits was the focus of an 
experiment where the subjects were mallard and pintail 
ducks (162). A total of eleven males were studied; all were 
second-generation crosses. A rating scale was devised that 
measured the extent to which the plumage of each of 
the ducks resembled the plumage of the first genera- 
tion’s parents. A score of 0 indicated that the hybrid had 
the same appearance (phenotype) as a pure mallard; a 
score of 20 meant that the hybrid looked like a pintail. 
Similarly, certain behavioral traits were quantified and a 
second scale was constructed that ranged from 0 (com- 
pletely mallard-like) to 15 (completely pintail-like). Use 
Theorem 11.2.1 and the following data to summarize the 
relationship between the plumage and behavioral indices. 
Does a linear model seem adequate? 


Male Plumage Index,x Behavioral Index, y 
R 7 3 
S 13 10 
D 14 11 
F 6 5 
WwW 14 15 
K 15 15 
U 4 7 
O 8 10 
Vv 7 4 
J 9 9 
L 14 11 


11.2.12. Verify that the coefficients a and b of the 
least squares straight line are solutions of the matrix 
equation 


n 
n ) Xj 
i=l 


a\ i=l 
n n b = n 
) Xj ) x 
i=l i=l 


11.2.13. Prove that a least squares straight line must 
necessarily pass through the point (x, y). 


11.2.14. In some regression situations, there are a pri- 
ori reasons for assuming that the xy-relationship being 
approximated passes through the origin. If so, the equa- 
tion to be fit to the (x;, y;)’s has the form y = bx. Use the 
least squares criterion to show that the “best” slope in that 
case is given by 


n 
) Xi Yi 
a, i=l 


a 


n 


11.2.15. One of the most startling scientific discover- 
ies of the twentieth century was the announcement in 
1929 by the American astronomer Edwin Hubble that 
the universe is expanding. If v is a galaxy’s recession 
velocity (relative to that of any other galaxy) and d is 
its distance (from that same galaxy), Hubble’s law states 
that 


v=Hd 


where H is known as Hubble’s constant. (To cosmologists, 
Hubble’s constant is a critically important number—its 
reciprocal, after being properly scaled, is an estimate of 
the age of the universe.) The following are distance and 
velocity measurements made on eleven galactic clusters 
(23). Use the formula cited in Question 11.2.14 to estimate 
Hubble’s constant. 
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Distance Velocity 

(millions of (thousands 
Cluster light-years) of miles/sec) 
Virgo 22 0.75 
Pegasus 68 2.4 
Perseus 108 3.2 
Coma Berenices 137 4.7 
Ursa Major No. 1 255 9.3 
Leo 315 12.0 
Corona Borealis 390 13.4 
Gemini 405 14.4 
Bootes 685 24.5 
Ursa Major No. 2 700 26.0 
Hydra 1100 38.0 


11.2.16. Given a set of n linearly related points, 
(x1, ¥1), (X2, y2),.--, and (Xn, yn), use the least squares cri- 
terion to find formulas for 


(a) a if the slope of the xy-relationship is known to 
be b*. 

(b) b if the y-intercept of the xy-relationship is known 
to be a*. 


11.2.17. Among the problems faced by job seekers want- 
ing to reenter the workforce, eroded skills and outdated 
backgrounds are two of the most difficult to overcome. 
Knowing that, employers are often wary of hiring indi- 
viduals who have spent lengthy periods of time away 
from the job. The following table shows the percent- 
ages of hospitals willing to rehire medical technicians 
who have been away from that career for x years (145). 
It can be argued that the fitted line should necessarily 
have a y-intercept of 100 because no employer would 
refuse to hire someone (due to outdated skills) whose 
career had not been interrupted at all—that is, applicants 
for whom x = 0. Under that assumption, use the result 
from Question 11.2.16 to fit these data with the model 
y=100+ bx. 


Percent of Hospitals 


Years of Willing to 
Inactivity, x Hire, y 
0.5 100 
1.5 94 
4 75 
8 44 
13 28 
18 17 
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11.2.18. A graph of the luxury suite data in Question 11.2.19. Set up (but do not solve) the equations nec- 
8.2.5 suggests that the xy-relationship is linear. Moreover, essary to determine the least squares estimates for the 
it makes sense to constrain the fitted line to go through trigonometric model, 

the origin, since x = 0 suites will necessarily produce y =0 


revenue. 


y=a+bx+csinx 


(a) Find the equation of the least squares line, y = bx. 
(Hint: Recall Question 11.2.14.) 
(b) How much revenue would 120 suites be expected to Assume that the data consist of the random sample (x, 


generate? 


Figure 11.2.7 


yi), (X2,y2), sey and (Xn, Yn). 


Nonlinear Models 


Obviously, not all xy-relationships can be adequately described by straight lines. 
Curvilinear relationships of all sorts can be found in every field of endeavor. Many 
of these nonlinear models, though, can still be fit using Theorem 11.2.1, provided the 
data have been initially “linearized” by a suitable transformation. 


Exponential Regression Suppose the relationship between two variables is best 
described by an exponential function of the form 
y=ae™ (11.2.1) 


Depending on the value of b, Equation 11.2.1 will look like one of the graphs 
pictured in Figure 11.2.7. Those curvilinear shapes notwithstanding, though, there is 
a linear model also related to Equation 11.2.1. 


y =ae* eS 
(b>0) 
e 
e e 
e 
. e 
e 6 e 
e e 
e oe 


If y=ae’*, it is necessarily true that 
Iny=Ina+bx (11.2.2) 


which implies that /n y and x have a linear relationship. That being the case, the for- 
mulas of Theorem 11.2.1 applied to x and In y should yield the slope and y-intercept 
of Equation 11.2.2. 

Specifically, 


i=1 
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and 
n 


In y; —b dx; 
i=1 


n 


i=l 
Ina=° 


Comment Transformations that induce linearity often require that the slope and/or 
y-intercept of the transformed model be transformed “back” to the original 
model. Here, for example, Theorem 11.2.1 leads to a formula for Ina, which means 
that the constant a appearing in the original exponential model is evaluated by 
calculating e!"“. 


Case Study 11.2.4 


Beginning in the 1970s, computers have steadily decreased in size as they have 
grown in power. The ability to have more computing potential in a four-pound 
laptop than in a mainframe of the 1970s is a result of engineers squeezing more 
and more transistors onto silicon chips. The rate at which this miniaturization 
occurs is known as Moore’s law, after Gordon Moore, one of the founders of 
Intel Corporation. His prediction, first articulated in 1965, was that the number 
of transistors per chip would double every eighteen months. 

Table 11.2.5 lists some of the growth benchmarks—namely, the number of 
transistors per chip—associated with the Intel chips marketed over the twenty- 
year period from 1975 through 1995. Based on these figures, is it believable that 
chip capacity is, in fact, doubling at a fixed rate (meaning that Equation 11.2.1 
applies)? And if so, how close is the actual doubling time to Moore’s prediction 
of eighteen months? 

A plot of y versus x shows that their relationship is certainly not linear (see 
Figure 11.2.8). The scatterplot more closely resembles the graph of y = ae?* 
when b > 0, as shown in Figure 11.2.7. 


Table 11.2.5 

Chip Year Years after 1975,x Transistors per Chip, y 
8080 1975 0 4,500 

8086 1978 3 29,000 

80286 1982 7 90,000 

80386 1985 10 229,000 

80486 1989 14 1,200,000 
Pentium 1993 18 3,100,000 
Pentium Pro 1995 20 5,500,000 

Source: en.wikipedia.org/wiki/Transistor—count. 


Table 11.2.6 shows the calculation of the sums required to evaluate 
the formulas for b and Ina. Here the slope and the _ y-intercept 
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(Case Study 11.2.4 continued) 


y = 7247.189¢9:343x 


& 
Ks} 
a 
4 
x 
Years after 1975 
Figure 11.2.8 
Table 11.2.6 
Years after 1975, x; we Transistors per Chip, y; In y; x; -Iny; 
0 0 4,500 8.41183 0 
3 9 29,000 10.27505 30.82515 
d 49 90,000 11.40756 79.85292 
10 100 229,000 12.34148 123.41480 
14 196 1,200,000 13.99783 195.96962 
18 324 3,100,000 14.94691 269.04438 
20 400 5,500,000 15.52026 310.40520 
72 1078 86.90093 1009.51207 


of the linearized model 
respectively: 


(Equation 11.2.2) are 0.342810 and 8.888369, 


__ 7(1009.51207) — 72(86.90093) 


7(1078) — (72)? 


= 0.342810 


and 


Ina 


Therefore, 


__ 86.9093 — (0.342810) (72) 
~ 7 


= 8.888369 


a= elt = 8888509 _ 7247.189 
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which implies that the best-fitting exponential model describing Intel’s techno- 
logical advances in chip design has the equation 


y = 7247.189¢e° 43" 


(see Figure 11.2.8). 

To compare Equation 11.2.1 to Moore’s “eighteen-month doubling time” 
prediction requires that we write y = 7247.189e°*4* in the form y = 
7247.189(2)*. But 


09343 = 70.495 


so another way to express the fitted curve would be 


y = 7247.189(2°?*) (11.2.3) 


In Equation 11.2.3, though, y doubles when 2°49>* = 2, or, equivalently, when 


0.495x = 1, which implies that 2.0 years is the empirically determined technology 
doubling time, a pace not too much slower than Moore’s prediction of eighteen 
months. 


About the Data In April of 2005, Gordon Moore pronounced his law dead. He 
said, “It can’t continue forever. The nature of exponentials is that you push them out 
and eventually disaster happens.” If by “disaster” he meant that technology often 
makes a quantum leap, moving well beyond what an extrapolated law could predict, 
he was quite correct. Indeed, he could have made this declaration in 2003. By that 
year, the Itanium 2 featured 220,000,000 transistors on a chip, whereas the model of 
the case study predicts the number to be only 


y = 7247.189e93B8%) — 107, 432, 032 
(In the equation, x = 2003 — 1975 = 28.) 


Logarithmic Regression Another frequently encountered curvilinear model that 
can be easily linearized is the equation 


y=ax? (11.2.4) 
Taking the common log of both sides of Equation 11.2.4 gives 
log y=log a+blog x 
which implies that log y is linear with log x. Therefore, 


n> log x; - log y; — (= logs) (= log ») 
i=1 i=1 i=1 


b=— 


n n 2 
n >> (log x;)? — (= logs) 
i=1 i=l 


and 
> log y; — b > log x; 
i=l 


loga = = 


n 
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Regressions of this type have slower growth rates than exponential mod- 
els and are particularly useful in describing biological and engineering 


phenomena. 


Case Study 11.2.5 


Among mammals, the relationship between the age at which an animal devel- 
ops locomotion and the age at which it first begins to play has been widely 
studied. Table 11.2.7 lists “onset” times for locomotion and for play in eleven 
different species (41). Graphed, the data show a pattern that suggests that 
y = ax?’ would be a good function for modeling the xy-relationship (see 


Figure 11.2.9). 


Table 11.2.7 
Locomotion Play Begins, 
Species Begins, x (days) y (days) 
Homo sapiens 360 90 
Gorilla gorilla 165 105 
Felis catus 21 21 
Canis familiaris 23 26 
Rattus norvegicus 11 14 
Turdus merula 18 28 
Macaca mulatta 18 21 
Pan troglodytes 150 105 
Saimiri sciurens 45 68 
Cercocebus alb. 45 75 
Tamiasciureus hud. 18 46 
y 
y =5.42x°6 
150 F a 
Z 100F a as 
4 
Fy * we e 
z yi 
om _ 
0te / 
/ 
ve 
as 
| | | | | | | 
0 100 200 300 400 


Locomotion begins (days) 


Figure 11.2.9 
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Table 11.2.8 
Xj log x; Vi log yi (logx;)? log x; log y; 


360 = 2.55630 90 = 1.95424 6.53467 4.99562 
165 2.21748 =105 2.02119 4.91722 4.48195 
21 = 1.32222 21 = 1.32222 1.74827 1.74827 
23 1.36173 26  ~=1.41497 1.85431 1.92681 
11 = 1.04139 14 -:1.14613 1.08449 1.19357 
18 9 1.25527 28 1.44716 1.57570 1.81658 
18 1.25527 21 = 1.32222 1.57570 1.65974 
150 2.17609 105 = 2.02119 4.73537 4.39829 
45 1.65321 68 1.83251 2.73310 3.02952 
45 1.65321 75 1.87506 2.73310 3.09987 
18 1.25527 46 1.66276 1.57570 2.08721 
17.74744 18.01965  31.06763 = 30.43743 


The sums and sums of squares necessary to find a and b are calculated 
in Table 11.2.8. Substituting into the formulas on p. 547 for the slope and 
y-intercept of the linearized model gives 


__ 11(30.43743) — (17.74744) (18.01965) 
7 11(31.06763) — (17.74744)2 


= 0.56 


and 


__ 18.01965 — (0.56) (17.74744) 


1 
oga 7 


= 0.73364 


Therefore, a = 10°79 — 5.42, and the equation describing the xy-relationship 
is y=5.42x°° (see Figure 11.2.9). 


Logistic Regression Growth is a fundamental characteristic of organisms, institu- 
tions, and ideas. In biology, it might refer to the change in size of a Drosophila 
population; in economics, to the proliferation of global markets; in political science, 
to the gradual acceptance of tax reform. Prominent among the many growth models 
capable of describing situations of this sort is the logistic equation 


L 


= —__ 11.2.5 
1+ e@tbx ( ) 


y 


where a,b, and L are constants. For different values of a and b, Equation 11.2.5 
generates a variety of S-chaped curves. 
To linearize Equation 11.2.5, we start with its reciprocal: 


1 _ 14+ et 


y L 
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Therefore, 
| ae ett bx 
and 
L- : 
y = eitbhx 
» 
Equivalently, 


L— 
n(=—") aos 
y 


| 
which implies that In (—) is linear with x. 
y 


Comment The parameter L is interpreted as the limit to which y is converging as 
x increases. In practice, L is often estimated simply by plotting the data and “eye- 
balling” the y-asymptote. 


Case Study 11.2.6 


Biological organisms often exhibit exponential growth. However, in some cases, 
that rapid rate of growth cannot be sustained. Such factors as lack of nutrition 
to support a large population or the buildup of toxins limit the rate of growth. 
In such cases the curve begins concave up, inflects at some point, and becomes 
concave down and asymptotic to a limit. 

A now-classical experiment provides data with the above characteristics. 
Carlson (20) measured the amount of biomass of brewer’s yeast (Saccha- 
romyces Cerevisiae) at one-hour intervals. Table 11.2.9 shows the results. 


Table 11.2.9 

Hour YeastCount Hour Yeast Count 
0 9.6 9 441.0 
1 18.3 10 513.3 
2 29.0 11 559.7 
3 47.2 12 594.8 
4 71.1 13 629.4 
5 119.1 14 640.8 
6 174.6 15 651.1 
7 257.3 16 655.9 
8 350.7 17 659.6 


The scatterplot for these eighteen data points has a definite S-shaped 
appearance (see Figure 11.2.10), which makes Equation 11.2.5 a good candidate 
for modeling the xy-relationship. The limit to which the population is converg- 
ing appears to be about 700. Quantify the population/time relationship by fitting 
a logistic equation to these data. Let L = 700. 
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Population 


0 2 4 6 8 10 12 14 16 18 


Hours 


Figure 11.2.10 


The form of the linearized version of Equation 11.2.5 requires that we find 
the following sums: 


18 18 


- 700 — 
Dera) rn TO) = 1.75603, ) \ x7 = 1785, and 


i=1 


18 


700 — y; 
yo x- in (P=) —— 197.4007 
Yi 


i=l } 
Substituting In ( 700-11) for y; into the formulas for a and b in Theorem 11.2.1 


gives 


18(—197. - : 
is 8(—197.40071) — (153) (1.75603) — 9.4382 
18(1785) — (153)? 


and 
__ 1.75603 — (—0.4382) (153) 
- 18 


= 3.822 


so the best-fitting logistic curve has equation 


700 
YT G3:822-0.4382 


Other Curvilinear Models While the exponential, logarithmic, and logistic equa- 
tions are three of the most common curvilinear models, there are several others that 
deserve mention as well. Table 11.2.10 lists a total of six nonlinear equations, includ- 
ing the three already described. Along with each is the particular transformation 
that reduces the equation to a linear form. Proofs for parts (d), (e), and (f) will be 
left as exercises. 
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Table 11.2.10 


ie] 


d. If y= 


x 
e. = ea 


ad 


Questions 


11.2.20. Radioactive gold (‘°° Au-aurothiomalate) has an 
affinity for inflamed tissues and is sometimes used as a 
tracer to diagnose arthritis. The data in the following table 
(62) come from an experiment investigating the length 
of time and the concentrations that '*° Au-aurothiomalate 
is retained in a person’s blood. Listed are the serum 
gold concentrations found in ten blood samples taken 
from patients given an initial dose of 50 mg. Follow-up 
readings were made at various times, ranging from one to 
seven days after injection. In each case, the retention is 
expressed as a percentage of the patient’s day-zero serum 
gold concentration. 


Serum Gold % 


Days after Injection, x Concentration, y 


94.5 
86.4 


ANDNDMNWNNNR FR 
Oo 
ne) 
KR 


(a) Fit an exponential curve to these data. 

(b) Estimate the half-life of ' Au-aurothiomalate; that 
is, how long does it take for half the gold to disappear 
from a person’s blood? 


If x denotes days after injection and y denotes 


10 10 
serum gold % concentration, then }* x; = 35, x? = 169, 
i i=l 


i=1 = 


10 10 
In y, =41.35720, and )°x; In y; = 137.97415. 
=1 


i=l 


i 


a. If y=ae’’, then Iny is linear with x. 


b. If y=ax’, then log y is linear with log x. 
L 
. Ify=L/d +e), then In (—*) is linear with x. 
y 

, then — is linear with x. 

x y 
i eee ats ll 

, then — is linear with —. 


1 
If y=1—e~*', then InIn () is linear with Inx. 
—y 


11.2.21. The growth of the federal debt is one of the char- 
acteristic features of the U.S. economy. The rapidity of the 
increases from 1996 to 2006, as shown in the table below, 
suggests an exponential model. 


Gross Federal Debt 
Year Years after 1995, x (in $ trillions), y 
1996 1 5.181 
1997 2 5.396 
1998 3 5.478 
1999 4 5.606 
2000 5 5.629 
2001 6 5.770 
2002 7 6.198 
2003 8 6.760 
2004 9 7.355 
2005 10 7.905 
2006 11 8.451 


Source: whitehouse.gov/omb/budget/fy2008/pdf/hist.pdf. 


(a) Find the best-fitting exponential curve, using the 
method of least squares together with an appro- 
priate linearizing transformation. Use the sums: 


20 20 
Yo In y; =20.16825 and >> x; - In y; = 126.33786. 


i=l i=1 

(b) The official Office of Management and Budget pre- 
diction for 2007 was $9 trillion. Compare this figure 
to the projection using the model from part (a). 

(c) Even though the model of part (a) is considered 
“good” by a criterion to be given in Section 11.4 
(r squared), plot the residuals and consider what they 
say about the exponential fit. 


11.2.22. Used cars are often sold wholesale at auctions, 
and from these sales, retail sales prices are recommended. 


The following table gives the recommended prices in 2009 
for a four-door manual transmission Toyota Corolla based 
on the age of the car. 


Age (in years),x Suggested retail price, y 
1 $14,680 
12,150 
11,215 
10,180 
9,230 
8,455 
7,730 
6,825 
6,135 
10 5,620 


Source: www.bb.com. 


OMADNBRWY 


(a) Fit these data with a model of the form y = ae™. 
Graph the (x;, y;)’s and superimpose the least 
squares exponential curve. 

(b) What would you predict the retail price of an eleven- 
year-old Toyota Corolla to be? 

(c) The price of a new Corolla in 2009 was $16,200. Is 
that figure consistent with the widely held belief that 
a new car depreciates substantially the moment it is 
purchased? Explain. 


11.2.23. The stock market showed steady and significant 
growth during the period from 1981 to 2000. This growth 
was reflected in the Dow Jones Industrial Average. The 
table gives the Dow Jones average (rounded to the near- 
est whole number) for the opening of the stock market in 
January for the years 1981 to 2000. 


Dow Jones Industrial 


Years after 1981, x Average, y 
0 947 
i 871 
2 1,076 
3 1,221 
4 1,287 
5 1,571 
6 2,158 
7 1,958 
8 2,342 
9 2,591 

10 2,736 
11 3,223 
12 3,310 
13 3,978 
14 3,844 
15 5,395 
16 6,813 
17 7,907 
18 9,359 
12 10,941 


Source: finance.yahoo.com/of/hp?s=%5EDJI. 
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20 20 
Use the fact that )7In y,; = 158.58560 and }°x;-In y; = 
i=l i=l 
1591.99387 to fit the data with an exponential model. 


11.2.24. Suppose a set of n (x;, y;)’s are measured on a 
phenomenon whose theoretical xy-relationship is of the 


form y =ae”. 


d 
(a) Show that ~~ = by implies that y =ae’’. 
Xx 


(b) On what kind of graph paper would the (xj, y,)’s 
show a linear relationship? 


11.2.25. In 1959, the Ise Bay typhoon devastated parts of 
Japan. For seven metropolitan areas in the storm’s path, 
the following table gives the number of homes damaged 
as a function of peak wind gust (118). Show that a func- 
tion of the form y = ax’ provides a good model for the 
data. 


Peak Wind Gust Numbers of Damaged 
City (hundred mph),x Homes (in thousands), y 
A 0.98 25.000 
B 0.74 0.950 
C 1.12 200.000 
D 1.34 150.000 
E 0.87 0.940 
F 0.65 0.090 
G 1.39 260.000 


Use the following sums: 


7 7 
Y log x; =—0.067772 Y log y; =7.1951 
i=l i=l 


1 7 
>Ydog x;)? =0.0948679 (og x;)(log y;) =0.92314 
i=l i=] 


11.2.26. Studies have shown that certain ants in a colony 
are assigned foraging duties, which require them to come 
and go from the colony on a regular basis. Furthermore, 
if y is the colony size and x is the number of ants that 
forage, the relationship between y and x has the form 
y = ax’, where a and b vary from species to species. 
Once the parameter values have been estimated for a 
particular kind of ant, biologists can count the (relatively 
small) number of ants that forage and then use the regres- 
sion equation to estimate the (much larger) number of 
ants living in the colony. The table on p. 554 gives the 
results of a “calibration” study done on the red wood ant 
(Formica polyctena): Listed are the actual colony sizes, 
y, and the foraging sizes, x, recorded for fifteen of their 
colonies (94). 
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(a) Find a and b using the sums below. 

(b) Ifthe number of red wood ants seen foraging is 2500, 
what would be a reasonable estimate for the size of 
the colony from which they came? 


15 
> log x; =41.77441 
i=l 


15 
Y log y; = 52.79857 
i=l 


1s 1s 
>> dog x;)? = 126.60450 Slog x; - log y; = 156.03811 
i=l i=l 


Foraging Size, x Colony Size, y 


45 280 

74 222 
118 288 

70 601 
220 1,205 
823 2,769 
647 2,828 
446 3,229 
765 3,762 
338 7,551 
611 8,834 
4,119 12,584 
850 12,605 
11,600 34,661 
64,512 139,043 


11.2.27. Over the years, many efforts have been made to 
demonstrate that the human brain is appreciably different 
in structure from the brains of lower-order primates. In 
point of fact, such differences in gross anatomy are discon- 
certingly difficult to discern. The following are the average 
areas of the striate cortex (x) and the prestriate cortex (y) 
found for humans and for three species of chimpanzees 
(129). 


Area 


Striate Cortex, Prestriate Cortex, 


Primate x (mm?) y (mm’) 
Homo 2613 7838 
Pongo 1876 2864 
Cercopithecus 933 1334 
Galago 78.9 40.8 


Plot the data and superimpose the least squares curve, 
b 
y=ax”. 


11.2.28. Years of experience buying and selling commer- 
cial real estate have convinced many investors that the 
value of land zoned for business (y) is inversely related 


to its distance (x) from the center of town—that is, y = 
a+b. -—. If that suspicion is correct, what should be the 
Xx 


appraised value of a piece of property located ; mile from 
the town square, based on the sales listed below? 


Distance from 


Land Center of City (in Value 


Parcel thousand feet),x (in thousands), y 
H1 1.00 $20.5 
B6 0.50 42.7 
Q4 0.25 80.4 
L4 2.00 10.5 
T7 4.00 6.1 
D9 6.00 6.0 
E4 10.00 3.5 


11.2.29. Verify the claims made in parts (d), (e), and (f) 
of Table 11.2.10—that is, prove that the transformations 
cited will linearize the original models. 


11.2.30. During the 1960s, when the Cold War was fuel- 
ing an arms race between the Soviet Union and the United 
States, the number of American intercontinental ballis- 
tic missiles (ICBMs) rose from 18 to 1054 (9). Moreover, 
the sizes of the ICBM stockpile during that decade had 
an S-shaped pattern, suggesting that the logistic curve 
would provide a good model. Graph the following data, 
and approximate the xy-relationship with the function y= 


. Assume that L = 1055. 


1+ eat 
Years after Number of 
Years 1959, x ICBMs, y 
1960 1 18 
1961 2 63 
1962 3 294 
1963 4 424 
1964 5 834 
1965 6 854 
1966 7 904 
1967 8 1054 
1968 9 1054 
1969 10 1054 


11.2.31. The following table shows a portion of the results 
from a clinical trial investigating the effectiveness of a 
monoamine oxidase inhibitor as a treatment for depres- 
sion (207). The relationship between y, the percentage of 


subjects showing improvement, and x, the patient’s age, 
appears to be S-shaped. Graph the data and superimpose 
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Age Group Age Mid-Point,x % Improved, y In (=) 


a graph of the least squares curve y = Ties Take L to [28, 32) 30 11 1.49393 
be 60. [32, 36) 34 14 1.18958 
[36, 40) 38 19 0.76913 
[40, 44) 42 32 —0.13353 
[44, 48) 46 42 —0.84730 
[48, 52) 50 48 —1.38629 
[52, 56) 54 50 —1.60944 
[56, 60) 58 52 —1.87180 


Example 
11.3.1 


11.3 The Linear Model 


Section 11.2 views the problem of “curve fitting” from a purely geometrical per- 
spective. The observed (x;, y;)’s are assumed to be nothing more than points in the 
xy-plane, devoid of any statistical properties. It is more realistic, though, to think of 
each y as the value recorded for a random variable Y, meaning that a distribution of 
possible y-values is associated with every value of x. 

Consider, for example, the connecting rod weights analyzed in Case 
Study 11.2.1. The first rod listed in Table 11.2.1 had an initial weight of x = 2.745 oz. 
and, after the tooling process was completed, a finished weight of y =2.080 oz. 
It does not follow from that one observation, of course, that an initial weight of 
2.745 oz. necessarily leads to a finished weight of 2.080 oz. Common sense tells 
us that the tooling process will not always have exactly the same effect, even on 
rods having the same initial weight. Associated with each x, then, there will be a 
range of possible y-values. The symbol fy), (y) is used to denote the pdfs of these 
“conditional” distributions. 


Definition 11.3.1. Let fy),() denote the pdf of the random variable Y for a 
given value x, and let E(Y |x) denote the expected value associated with fy), (y). 
The function 


y=E(Y |x) 


is called the regression curve of Y on x. 
Suppose that corresponding to each value of x in the interval O<x <1 isa 
distribution of y-values having the pdf 


x+y 
xt+}) 


fyxQ= O<y<l; O<x<l 


Find and graph the regression curve of Y on x. 
Notice, first of all, that for any x between 0 and 1, fy),(y) does qualify as a pdf: 


1. fyx(y)>0, for O<y<1 andany 0<x<1l 


1 Uixty 
2; rQ)dy= dy=1 
[ freoray [ (So 
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Moreover, 
1 1 oy4y 
E(YY|x)=] y-frxQ)dy= | y- 7 dy 
0 0 x+5 
1 
2(x+2)  3(%+3) |I, 
3x +2 
a ; <x<l 
6x +3 
y 
fy io) = 2y fy Q)=y +5 fy iy) = 2 
1 : n ; 
; i 
i 
! Regression curve: 
1 
EV 10) =2 pe pee 


Figure 11.3.1 


2 
Figure 11.3.1 shows the regression curve, y = E(Y | x) = = ; 3? together with 
X 
three of the conditional distributions— fy\o(y) = 2y, Fy) =yt , and fyi(y) = 


2y+2 
bd . The fy\.(y)’s, of course, should be visualized as coming out of the plane of 


the paper. a 


A Special Case 


Definition 11.3.1 introduces the notion of a regression curve in the most gen- 
eral of contexts. In practice, there is one special case of the function y = E(Y | x) 
that is particularly important. Known as the simple linear model, it makes four 
assumptions: 


1. fyjx(y) is anormal pdf for all x. 
2. The standard deviation, o, associated with fy), (y) is the same for all x. 
3. The means of all the conditional Y distributions are collinear—that is, 


y=E(¥ |x)=fo+ fix 


4. All of the conditional distributions represent independent random variables. 
(See Figure 11.3.2.) 


Figure 11.3.2 


Theorem 
11.3.1 
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ho fyi, 
\ 


Fix (y) ye 


‘N 

? 
: =E(Y |x) =8)+ B,x 
/E(Y |x) 7 (¥ |x) = Byt+ Bix 


/ 


Xx. x: Xy 


Estimating the Linear Model Parameters 


Implicit in the simple linear model are three parameters— fp, 6, and o?. Typically, 
all three will be unknown and need to be estimated. Since the model assumes a 
probability structure for the Y-variable, estimates can be obtained using the method 
of maximum likelihood, as opposed to the method of least squares that we saw in 
Section 11.2. (Maximum likelihood estimates are preferable to least squares esti- 
mates because the former have probability distributions that can be used to set up 
hypothesis tests and confidence intervals.) 


Comment It would be entirely consistent with the notation used previously to 
denote the sample in Theorem 11.3.1 as (x1, yi), (%2, y2),.--, and (%, yn). To empha- 
size the important distinction, though, between the (lack of) assumptions on the 
y;’s made in Section 11.2 and the conditional pdfs fy),(y) introduced in Defini- 
tion 11.3.1, we will use random variable notation to write linear model data as 
(x1, Y)), (x2, Y2), tee and (Xn, Yn). 


Let (x1, Y1), (2, Y2),..., and (xn, Y,) be a set of points satisfying the simple linear 
model, E(¥ |x) = Bo + Bix. The maximum likelihood estimators for Bo, B\, and o? are 


given by 
fo-(Es) 60 
4 i=l i=l i=l 
Bi n n 2 
(E*)- (2) 
i=l i=l 
Bo=¥ — Bix 
and 


where Y; = Bo + Bixi,i=l,...,n. 
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Theorem 
11.3.2 


Proof Since each Y; is assumed to be normally distributed with mean equal to By + 
Bix; and variance equal to o”, the sample’s likelihood function, L, is the product 


L= I] frix: i) 
i=1 


n 


21] 1 fae —Bo- a=Pai y? 
i=l V2160 


The maximum of L occurs when the partial derivatives with respect to Bo, 6, and 0” 
all vanish. 

It will be easier, computationally, to differentiate —21n L, and the latter will be 
minimized for the same parameter values that maximize L. Here, 


1 n 
—2InL=n-InQr)-+n-In(o?) + D-(i — Bo- fix? 
a? & 


Setting the three partial derivatives equal to 0 gives 


d(—2InL 2; 

= a abo Bo — Bixi)(—1) =0 
d(—21InL 2, 

a = _ wo Bo — Bix:)(—x;) =0 
d(—2InL)_n ie 


2 2 
= ; Xj = 0) 
re ae: > (yi — Bo — Bixi) 
The first two equations depend only on fo and fi), and the resulting solutions for Bo 
and ; have the same forms that are given in the statement of the theorem. Substi- 
tuting the solutions from the first two equations into the third gives the expression 
for 6?. 


Comment Note the similarity in the formulas for the maximum likelihood esti- 
mators and the least squares estimates for fy and f;. The least squares estimates, 
of course, are numbers, while the maximum likelihood estimators are random 
variables. 

Up to this point, random variables have been denoted with | uppercase letters 
and their values with lowercase letters. In this section, boldface Bo and B, will rep- 
resent the maximum likelihood random variables, and plain-text Bo and A, will refer 
to specific values taken on by those random variables. 


Properties of Linear Model Estimators 


By virtue of the assumptions that define the simple linear model, we know that the 
estimators Bo B ;, and 6° are random variables. Before those estimators can be used 
to set up inference procedures, though, we need to establish their basic statistical 
properties — specifically, their means, variances, and pdfs. 


Let (x1, Y,), (x2, Y2),..., and (%,¥,) be a set of points satisfying the simple linear 
model, E(Y |x) = Bo + Bix. Let By, B,, and 6” be the maximum likelihood estimators 
for Bo, Bi, and o, respectively. Then 
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a. By and B, are both normally distributed. 
b. Bo and By are both unbiased: E(B) = = Bo and E(p)= = fi. 
o2 


c. Var(B)) = 
= (xi —*)? 
i=1 
o2 s xe = 
d. Var(.) = = =o*|1i4—4 
nY (xi —%)? > (ix)? 
i=l i=1 


Proof We will prove the statements for Bi: the results for Bo follow similarly. 

The equation for the estimator B, given in Theorem 11.3.1 is the simplest form 
that solves the likelihood equations (and the least squares equations as well). It is 
also convenient for computation. However, two other expressions for A, are useful 
for theoretical results. y 

To begin, take the version of 8, from Theorem 11.3.1: 


E)-E9) 


Dividing numerator and denominator by n gives 


re (11.3.1) 


Equation 11.3.1 expresses B, as a linear combination of independent normal vari- 
ables, so by the second corollary to Theorem 4.3.3, it is itself normal, proving 
part (a). 

To see that B, is unbiased, note that 


SO -DEW) YGi—DGot+Bim) Bo . @:-— 3) +815 i — Dey 
i=l i= i=l 


i=1 


EB)=75 aaa: 7 ; 
is #) —nx? (= #) —nx? (= ) — nx? 


i=1 i=l 


i=1 


n n 
2 =2 2 =2 
(S47) -ns (S47) —ni 
i=l i=l 


0+ B, oe Bi (Ex 
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To find Var(B,), rewrite the denominator of Equation 11.3.1 in the form 


(> “) —nx*? = 3 (x? — 2x,% +x) = ‘ Gi xy" 
i=1 i=l i=1 


which makes 


i — XY; 
8, = = (11.3.2) 
Dogs x) 
i=l 
Using Equation 11.3.2, Theorem 3.6.2, and the second corollary to Theorem 3.9.5 
gives 
m 1 fi : 
Var(B,) = Var | — So i —¥)Y; 
Y (i - #2 151 
i=l 
1! : =)2,2 
= . SG: x)°o 
b (Xj -3y a 
i=l 
bBiC res a 
i=l 
Theorem Let (x1, Y1), (x2, Y2),.--; (Xn; Yn) satisfy the assumptions of the simple linear model. 
11.3.3 Then 


a. BY, and 6° are mutually independent. 


no : eer : 
b. —, has a chi square distribution with n — 2 degrees of freedom. 
or 


Proof See Appendix 11.A.2. 


Corollary Let 6” be the maximum likelihood estimator for o* in a simple linear model. Then 
n 


a 6° is an unbiased estimator foro. 

n— 

Proof Recall that the expected value of a x? distribution is k (see Theorems 4.6.3 
and 7.3.1). Therefore, 


2 a2 

i 226\o2 (= 

n—2 n—2 o2 
ge 


= (n —2) [by part (b) of Theorem 11.3.3] 
he 


=o? < 


Corollary The random variables Y and &* are independent. < 


Theorem 
11.3.4 
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Estimating o7 


We know that the (biased) maximum likelihood estimator for o? in a simple linear 
model is 


be ar 
a2 
0 gy ee 


The unbiased estimator for o2 based on 6” is denoted S?, where 


n 


Ss 
n—2 


[Ae ee. ae 
a2 
rr hie By — Bixiy? 


Statistical software packages— including Minitab—typically print out s, rather than 
6, in summarizing the calculations associated with linear model data. To accommo- 
date that convention, we will use s? rather than 6? in writing the formulas for the 
test statistics and confidence intervals that arise in connection with the simple linear 
model. 


Comment Calculating )~ (y; — fo — Bix;)* = 92 Gi —3;)* can be cumbersome. 
i=1 


Three (algebraically equivalent) computing formulas are available that may be 
easier to use, depending on the data: 


Y= = OY SF a? (11.3.3) 
i=l i=1 i=1 


n n n 2 
. [bo tl) Eo) 
i=1 i=1 i=1 


n n 
m 1 
Oi - I = Diy -— Doi : : 
at a n 2 1 
i=1 i=1 i=1 ya; =5 DRE 1, 


St 


(11.3.4) 


Yi — 3) =) y7 — Body — Bi > xi; (11.3.5) 
i=l i=l i=l i=l 


Drawing Inferences about B, 


Hypothesis tests and confidence intervals for 6; can be carried out by defining a ¢ 
statistic based on the properties that appear in Theorems 11.3.2 and 11.3.3. 


Let (x1, Y1), (x2, Y2),..., and (xn, Y,) be a set of points that satisfy the assumptions of 
the simple linear model, and let S* = 1. X (Y; — By — B,xi)*. Then 


B,- Bi 


s/ Is Gz)? 
i=1 


has a Student t distribution with n — 2 degrees of freedom. 


Th-2= 
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Theorem 
11.3.5 


Proof We know from Theorem 11.3.2 that 


Bi -Bi 


a / I? — 3)? 
i=l 


a2 2 ‘ 
has a standard normal pdf. Furthermore, “= oe has a x? pdf with n —2 


Z= 


degrees of freedom, and, by Theorem 11.3.3, Z and aps’ are independent. From 
Definition 7.3.3, then, it follows that 


z | [22 | ») - —_ Bix#i 
oO n 
sf [So-3 
i=1 


has a Student ¢ distribution with n — 2 degrees of freedom. 


Let (x1, Y1), (2, Y2),..., and (xn, Y,) be a set of points that satisfy the assumptions of 
the simple linear model. Let 


Bi — Bi, 


7 I? @ — 3)? 
z=1 


a. To test Ho: Bi = B1, versus H,: 8, > Bi, at the a level of significance, reject Ho if 
t = ty n—2+ 

b. To test Ho: 8; = Bi, versus H,: B, < B,, at the a level of significance, reject Ho if 
t< —len-2- 

c. To test Ho: Bi = Bi, versus Hy: B, # B,, at the a level of significance, reject Ho if t 
is either (1) S —ta/2,n-2 OF (2) = tu/2,n-2- 


‘——] 


Proof The decision rule given here is, in fact, a GLRT. A formal proof proceeds 
along the lines followed in Appendix 7.A.4. We will omit the details. 


Comment A particularly common application of Theorem 11.3.5 is to test Ho: B) = 
0. If the null hypothesis that the slope is zero is rejected, it can be concluded (at 
the a level of significance) that E(Y) changes with x. Conversely, if Ho: 8, =0 
is not rejected, the data have not ruled out the possibility that variation in Y is 
unaffected by x. 


Case Study 11.3.1 


By late 1971, all cigarette packs had to be labeled with the words, “Warning: 
The Surgeon General Has Determined That Smoking Is Dangerous To Your 
Health.” The case against smoking rested heavily on statistical, rather than lab- 
oratory, evidence. Extensive surveys of smokers and nonsmokers had revealed 
the former to have much higher risks of dying from a variety of causes, including 
heart disease. 

(Continued on next page) 
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Typical of that research are the data in Table 11.3.1, showing the annual 
cigarette consumption, x, and the corresponding mortality rate, Y, due to 
coronary heart disease (CHD) for twenty-one countries (116). Do these 
data support the suspicion that smoking contributes to CHD mortality? 
Test Hp: 6; =0 versus H;: 6; > 0 at the a =0.05 level of significance. 


Table 11.3.1 
Cigarette CHD Mortality 
Consumption per per 100,000 
Country Adult per Year,x (ages 35-64), y 
United States 3900 256.9 
Canada 3350 211.6 
Australia 3220 238.1 
New Zealand 3220 211.8 
United Kingdom 2790 194.1 
Switzerland 2780 124.5 
Treland 2770 187.3 
Iceland 2290 110.5 
Finland 2160 233.1 
West Germany 1890 150.3 
Netherlands 1810 124.7 
Greece 1800 41.2 
Austria 1770 182.1 
Belgium 1700 118.1 
Mexico 1680 31.9 
Italy 1510 114.3 
Denmark 1500 144.9 
France 1410 59.7 
Sweden 1270 126.9 
Spain 1200 43.9 
Norway 1090 136.3 
From Table 11.3.1, 

21 21 

>> x; = 45,110 > yj = 3,042.2 

i=l i=l 

21 21 

Lac 109,957,100 > y? = 529,321.58 

i=l i=l 


21 
\ > xiyi = 7,319,602 


i=1 


Bi = : - 5) 
"(Eat)-(Es) 


__ 21(7,319,602) — (45,110) (3,042.2) 
~  21(109,957,100) — (45,110)? 


and it follows that 


= 0.0601 


(Continued on next page) 
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(Case Study 11.3.1 continued) 


and 
yyw — Ai Lx 
B el i=1 
: n 
,042.2 —0. 1(45,11 
_ 3,0 —— 9) 15.771 


The two other quantities needed for the test statistic are 


n n n 2 
Sune) 
i=l i=1 


i=1 


1 
= 109,957,100 — (=z) (45,100)* = 13,056,523.81 


n 


so_/ >> (x; —*)? = V13,056,523.81 = 3,613.38. 


i=l 


From Equation 11.3.5, 
I 21 21 21 
$= 122 (Siva d» — Bi Yam] 
i=l i=l i=l 
1 
= Jo 929:321.58 — (15.766) (3,042.2) — (0.0601)(7,319,602)] = 2,181.588 
and s = ./2, 181.588 = 46.707 
To test 
Ao: Bi =0 
versus 
Ap: B, > 0 


at the a =0.05 level of significance, we should reject the null hypothesis if t > 
tos.i9 = 1.7291. But 


ne Ai- Bi, _ 0.0601 —0 
7 i ~ 46.707/3,613.38 
sf yo Oy 
i=1 
= 4.65 


so our conclusion is clear-cut—reject Hp. It would appear that the level of CHD 
mortality in a country is affected by its citizens’ smoking habits. More specifi- 
cally, as the number of people who smoke increases, so will the number who die 
of coronary heart disease. 


Theorem Let (x1, Y1), (2, Y2),..., and (xn, Y,) be a set of points that satisfy the assumptions of 


11.3.6 : , iG ‘ z 
the simple linear model, and let s* = 1. X (yi — Bo — Bix;)*. Then 
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Bi — taj2,n-2° » Br + ta/2,n-2° 


os 
/ 3-3)? 
i=1 


Proof Let 7,2 denote a Student ¢ random variable with n — 2 degrees of freedom, 
in which case 


os 
/ > Oy — 2) 
i=l 


is a 10001 — a)% confidence interval for B. 


P(~to/2,n-2 < Th-2 S ta/2,n-2) = 1—a 


Substitute the expression for 7,» given in Theorem 11.3.4 and isolate 6; in the 
center of the inequalities. The resulting endpoints will be the expressions appearing 
in the statement of the theorem. 


Case Study 11.3.2 


For many firms, the cost of sales is a linear function of net revenue. This seems 
to be the case for Starbucks, now a staple of the coffee-drinking public. Prior 
to 1971, Americans drinking coffee outside of their homes had little choice but 
a weak, watery brew often kept for hours on a hotplate, giving a burned, bitter 
taste. In 1971, a company opened a coffee shop in Seattle’s famous Pike Place 
Market to serve robust and fresh coffee. The shop was named after a character 
in Herman Melville’s Moby Dick, and it signified the import of coffee across the 
seas. By 2007, the chain had grown to over fifteen thousand outlets. 

Table 11.3.2 shows Starbucks’ annual net revenue (x) and the cost of oper- 
ating the stores (y) primarily responsible for generating that revenue. Graphed, 
the xy-relationship is described very well by the line y = 18.57 +0.41x, where 
18.57 and 0.41 are the values of Bo and Bi calculated from the formulas in 
Theorem 11.3.1 (see Figure 11.3.3). 


Table 11.3.2 

Year Net Revenue (in $ millions), x Cost of Sales (in $ millions), y 
1999 1687 748 
2000 2178 962 
2001 2649 1113 
2002 3289 1350 
2003 4076 1686 
2004 5294 2199 
2005 6369 2605 
2006 7787 3179 
2007 9411 3999 
Source: Company reports. 


(Continued on next page) 
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(Case Study 11.3.2 continued) 


4500 5 
4000 + 
3500 5 
3000 + 
2500 4 


Cost of Sales 


T T T 
0 2000 4000 6000 8000 10,000 


Net Revenue 
Figure 11.3.3 
The true slope in this situation—,—is particularly important from the 
company’s perspective because it represents the amount that costs are likely to 
increase when revenues go up by one unit. That said, it makes sense to construct, 


say, a 95% confidence interval for 6; based on the observed Ai. 
Here, 


9 
Yai — ¥)? = 56,865,526.89 
i=l 


9 
>| (ai — ¥)? = /56,865,526.89 = 7540.92 


i=1 


and from Equation 11.3.5, 
1 9 ae: et 
r= 9-2 [Aye — By Yan | 
i=l i=l i=l 


1 
= gus lOs.cst — (18.57) (17,841) — (0.41) (108,239, 948)] = 2535.01 


so s = J 2535.01 = 50.35. 


Using t.025,7 = 2.3646, the expression given in Theorem 11.3.6 reduces to 


0.41 — 2.3646 a0es 0.41 + 2.3646 anes = ($0.394, $0.426) 
, , 7540.92’ * ; 7540.92) ~~ 


Judging from these data, then, the company can anticipate that costs will 
rise somewhere between thirty-nine and forty-three cents for every one-dollar 
increase in revenues. 


About the Data The predictive value of the regression equation in Case 
Study 11.3.2 depends on a continuing healthy economic climate after the years 1999- 
2007, the period for which the data were generated. In the case of Starbucks, a 
serious economic downturn began in 2008, and in the summer of that year, Starbucks 
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announced plans to close six hundred stores. An equation based on 1999-2007 data 
might still be useful, but a more prudent strategy would be to revisit the equation in 
light of what happened in 2008 and 2009, when consumers’ discretionary expenses 
were curtailed. 


Drawing Inferences about B, 


In practice, the value of fp is not likely to be as important as the value of £;. Slopes 
often quantify particularly important aspects of xy-relationships, which was true, 
for example, in Case Study 11.3.2. Nevertheless, hypothesis tests and confidence 
intervals for Bo can be easily derived from the results given in Theorems 11.3.2 and 
11.3.3. 

The GLRT procedure for assessing the credibility of Hp : By = Bo, is based on a 
Student ¢ random variable with n — 2 degrees of freedom: 


(Bo — fo, Va f@i-2? 
= Bo ~ Po, (11.3.6) 


Th-2 = = 


s 2 Var(Bo) 
i=l 


“Inverting” Equation 11.3.6 (recall the proof of Theorem 11.3.6) yields 


2 ie 
2 x; 5/0 %; 
i=l i=l 


» Bo + ta/2,n-2° 


ap (x; — x)? Jn ¥ (i — 3)? 
j=! = 


Bo — ta/2,n-2° 


as the formula for a 100(1 — a)% confidence interval for Bo. 


Drawing Inferences about o7 


Since (n — 2)S*/o” has a x pdf with n —2 df (if the n observations satisfy the 
stipulations implicit in the simple linear model), it follows that 


(n — 2)S? 


2 2 
F EZ Ss mts xh aprons] =l-a 


Oo 
Equivalently, 
—2)S° —2)S? 
P lasons < o < ae —l-a 
X1—a/2,n-2 Xa/2,n—2 
in which case 


2 2 pe 
Xt-a/2.n-2  Xa/2,n—-2 


(n—2)s2 (n— a] 


becomes the 100(/ — a)% confidence interval for o* (recall Theorem 7.5.1). Testing 
Hy : 07 =G? is done by calculating the ratio 
2 (n — 2)s? 


2 
9% 
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which has a x? distribution with n — 2 df when the null hypothesis is true. Except for 
the degrees of freedom (n — 2 rather than n — 1), the appropriate decision rules for 
one-sided and two-sided H,’s are similar to those given in Theorem 7.5.2. 


Questions 


11.3.1. Insect flight ability can be measured in a labora- 
tory by attaching the insect to a nearly frictionless rotating 
arm with a thin wire. The “tethered” insect then flies in 
circles until exhausted. The nonstop distance flown can 
easily be calculated from the number of revolutions made 
by the arm. The following are measurements of this sort 
made on Culex tarsalis mosquitos of four different ages. 
The response variable is the average distance flown until 
exhaustion for forty females of the species (150). 


Distance Flown, y 
Age, x (weeks) (thousand meters) 


1 12.6 
2 11.6 
3 6.8 
4 9.2 


Fit a straight line to these data and test that the slope 
is zero. Use a two-sided alternative and the 0.05 level of 
significance. 


11.3.2. The best straight line through the Massachusetts 
funding/graduation rate data described in Question 11.2.7 
has the equation y = 81.088 + 0.412x, where s = 11.78848. 


(a) Construct a 95% confidence interval for f;. 

(b) What does your answer to part (a) imply about the 
outcome of testing Ho: 6; =0 versus H;: 6, £0 at the 
a=0.05 level of significance? 

(c) Graph the data and superimpose the regression line. 
How would you summarize these data, and their 
implications, to a meeting of the state School Board? 


11.3.3. Based on the data in Question 11.2.1, the rela- 
tionship between y, the ambient temperature, and x, the 
frequency of a cricket’s chirping, is given by y = 25.2+ 
3.29x, where s = 3.83. At the a =0.01 level of significance, 
can the hypothesis that chirping frequency is not related 
to temperature be rejected? 


11.3.4. Suppose an experimenter intends to do a regres- 
sion analysis by taking a total of 2n data points, where 
the x;’s are restricted to the interval [0, 5]. If the xy- 
relationship is assumed to be linear and if the objective is 
to estimate the slope with the greatest possible precision, 
what values should be assigned to the x;’s? 


11.3.5. Suppose a total of n = 9 measurements are to be 
taken on a simple linear model, where the x;’s will be set 


equal to 1,2,..., and 9. If the variance associated with the 
xy-relationship is known to be 45.0, what is the probability 
that the estimated slope will be within 1.5 units of the true 
slope? 


11.3.6. Prove the useful computing formula (Equa- 
tion 11.3.5) that 


YO; Bo Bxy=> \y? Bod yi — Bi iyi 
i=1 i=1 i=l i=l 


11.3.7. The sodium nitrate (NaNO) solubility data in 
Question 11.2.3 is described nicely by the regression line 
y = 67.508 + 0.871x, where s = 0.959. Construct a 90% 
confidence interval for the y-intercept, fo. 


11.3.8. Set up and carry out an appropriate hypothesis 
test for the Hanford radioactive contamination data given 
in Question 11.2.9. Let a = 0.05. Justify your choice of Hp 
and H,. What do you conclude? 


11.3.9. Test Ho: 6, =0 versus H,: 6; 40 for the plumage 
index/behavioral index data given in Question 11.2.11 Let 
a =0.05. Use the fact that y = 0.61 + 0.84x is the best 
straight line describing the xy-relationship. 


11.3.10. Let (4), Y,), (%2,¥2),..., and (x,,¥,) be a set 
of points satisfying the assumptions of the simple linear 
model. Prove that 


E(Y) =Bo+ Bix 


11.3.11. Derive a formula for a 95% confidence interval 
for Bo if n(x;, Y;)’s are taken on a simple linear model 
where o is known. 


11.3.12. Which, if any, of the assumptions of the simple 
linear model appear to be violated in the following scat- 
terplot? Which, if any, appear to be satisfied? Which, if 
any, cannot be assessed by looking at the scatterplot? 


y e 


AN AN 
y= Byt Bx. 
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11.3.13. State the decision rule and the conclusion if 11.3.15. Recall Kepler’s Third Law data given in Ques- 
Ho: 0? = 12.6 is to be tested against H;:0*7 412.6 where tion 8.2.1. The estimated regression line describing the 


n= 24, s* = 18.2, and a =0.05. 


xy-relationship has the equation y = 2.27 + 0.16x, where 
s =2.31. Construct a 90% confidence interval for o?. 


11.3.14. Construct a 90% confidence interval for o? in the 
cigarette-consumption/CHD mortality data given in Case 


Study 11.3.1. 


Theorem 
11.3.7 


Drawing Inferences about E(Y | x) 


In Case Study 11.3.1, the random variable Y represents the CHD mortality resulting 
from x cigarette consumption. A public health official would certainly want to have 
some idea of the range of mortality likely to be encountered in a country where x is, 
say, 4200. 

Intuition tells us that a reasonable point estimator for E(Y | x) is the height 


of the regression line at x—that is, Y = Bo +B ,x. By Theorem 11.3.2, the latter is 
unbiased: 


E(Y) = E(Bo + Bix) = E(By) + x E(B) = Bo + Bix 


Of course, to use Y in any inference procedure requires that we know its 
variance. But 


Var(Y) = Var(By + Bix) = Var(Y — Bix + Bix) 
=Var[Y + Bi (x —x)] 
= Var(Y) + (x — ¥)°Var(B,) (why?) 


earns Rizecds og 
*  ¥@-x? 
i=l 
_ 22 
Bera ees 
Y (xj) — ¥)? 


i=] 


An application of Definition 7.3.3, then, allows us to construct a Student + 
random variable based on Y. Specifically, 


Th-2 = 


a + pix) (n—2)S?__ ¥ — (Bo + Bix) 
Sige See? 2X Gray? 


has a Student ¢ distribution with n — 2 degrees of freedom. Isolating By + Bix = E(Y | 
x) in the center of the inequalities P(—fy2,n-2 < Tn-2 < tw/2,n-2) = 1—@ produces a 
100(1 — w)% confidence interval for E(Y | x). 


Let (x1, Y1), (x2, Y2),..., and (Xn, Y,) be a set of points that satisfy the assumptions of 
the simple linear model. A 100(1 — a)% confidence interval for E(Y | x) = Bo + Bix is 
given by (vy —w, ¥ + w), where 
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Example 
11.3.2 


=z) 


> G3) 
t=1 


1 
W = ty/2,n-2°S | — + 


and $ = Bo + Bix. 


Look again at Case Study 11.3.1. Suppose a country’s public health officials estimate 
cigarette consumption to be 4200 per adult per year. If that were the case, what 
CHD mortality would they expect? Answer the question by constructing a 95% 
confidence interval for E(Y|4200). 


21 7 
Here, n = 21, to9s,19 = 2.0930, )\(x; — ¥)? = 13,056,523.81, s = 46.707, Bo = 


i=1 


15.7661, Bi = 0.0601, and x = 2148.095. From Theorem 11.3.7, then, 
y = 15.7661 + 0.0601 (4200) = 268.1861 


1 a (4200 — 2148.095)? 
21 13,056,523.81 


and the 95% confidence interval for E(Y|4200) is 
(268.1861 — 59.4714, 268.1861 — 59.4714) 


= 59.4714 


w= 2093046707 


which rounded to two decimal places is 


(208.71, 327.66) 


Comment Notice from the formula in Theorem 11.3.7 that the width of a confi- 
dence interval for E(Y | x) increases as the value of x becomes more extreme. That 
is, we are better able to predict the location of the regression line for an x-value 
close to x than we are for x-values that are either very small or very large. 

Figure 11.3.4 shows the dependence of w on x for the data from Case 
Study 11.3.1. The lower and upper limits for the 95% confidence interval for E(Y | x) 
have been calculated for all x. Pictured is the dotted curve (or 95% confidence band) 
connecting those endpoints. The width of the band is smallest when x = 2148.1 (=x). 
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Drawing Inferences about Future Observations 


A variation on Theorem 11.3.7 is the determination of a range of numbers that 
would have a high probability of including the value Y of a single future observation 
to be recorded at some given level of x. In Case Study 11.3.1, public health officials 
might want to predict the actual (not the average) CHD mortality that would occur 
if cigarette consumption is x. 

Let (x1, Y1), (x2, Y2),.--, Qin, Y,) be a set of n points that satisfy the assump- 
tions of the simple linear model, and let (x, Y) be a hypothetical future observation, 
where Y is independent of the n Y;’s. A prediction interval is a range of numbers that 
contains Y with a specified probability. 

Consider the difference Y — Y. Clearly, 


E(¥ —Y)=E(¥) — E(Y) = (60 + Bix) — (Bo + Bix) =0 
and 


Var(Y — Y) = Var(Y) + Var(Y) 


28 i ae! ee 
"YS (4; —¥) 
i=1 
1 ee \2 
=o? ete cana 
"YS ( — 8)? 


i=1 


Following exactly the same steps that were taken in the derivation of Theo- 
rem 11.3.7, a Student ¢ random variable with n — 2 degrees of freedom can be con- 
structed from Y — Y (using Definition 7.3.3). Inverting the equation P(—ty/2,n-2 < 
Th-2 < te/2,n-2) = 1— a will then yield the prediction interval (y — w, } + w) given in 
Theorem 11.3.8. 


Let (x1, Y1), (%2, Y2),..., and (xn, Y,) be a set of n points that satisfy the assumptions 
of the simple linear model. A 100(1 — «)% prediction interval for Y at the fixed value 
x is given by (y —w, 9+ w), where 


1 x —x)? 
— a/2,n-2°S pe, 
De ee) 
i=] 


and $ = Bo + Bix. 


Based on the data in Case Study 11.3.1, we calculated in Example 11.3.2 that a 95% 
confidence interval for E(Y¥|4200) is (208.71, 327.66). How does that compare to the 
corresponding 95% prediction interval for Y? 

When x = 4200, } = 268.1861 for both intervals. From Theorem 11.3.8, the width 
of the 95% prediction interval for Y is: 


1 4200 — 2148.095)? 
w = 2.0930(46.707),/1+—+ : ) = 114.4725 
21 13,056,523.81 
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The 95% prediction interval, then, is 
(268.1861 — 114.4725, 268.1861 + 114.4725) 
which rounded to two decimal places is 
(153.76, 382.61) 
which makes it 92% wider than the 95% confidence interval for E(Y|4200). = 


Testing the Equality of Two Slopes 


We saw in Chapter 9 that the comparison of two treatments or two conditions often 
leads to a hypothesis test that the mean of one is equal to the mean of the other. 
Similarly, the comparison of two linear xy-relationships often requires that we test 
Ho: 6; = By, where f, and fy are the true slopes associated with the two regressions. 
If the data points taken on the two regressions are all independent, a two- 
sample t test can be set up based on the properties in Theorems 11.3.2 and 11.3.3. 
Theorem 11.3.9 identifies the appropriate test statistic and summarizes the GLRT 
decision rule. Details of the proof will be omitted. 
Let (x1, Y1), (x2, Y2),.--, Qn, Yn) and (x¥, VY"), (43, Y3'),.-., @, Y) be two indepen- 
dent sets of points, each satisfying the assumptions of the simple linear model — that is, 
E(Y |x) =Bo+ Bix and E(Y* | x*) = Bj + Byx*. 


a. Let 
B, — B* — (1 — 8%) 


where 


Ye Y — Bo + BP + See — Bs 4+ Bee 
i=l i=l 


S= 
n+m—4 


Then T has a Student t distribution with n +m —4 degrees of freedom. 
b. To test Ho: Bi = BY versus H, : Bi 4 BF at the a level of significance, reject Ho if t 
is either (1) < —te/2,n4m—4 OF (2) > tw/2,n+m—4, Where 


i 


S n : + m | 
Y@i-¥2 YF -¥*)? 
i=l i=l 


(One-sided tests are defined in the usual way by replacing +ty/2,n+m—4 With either 


la.ntm—4 or —tontm—4-) 


Genetic variability is thought to be a key factor in the survival of a species, the idea 
being that “diverse” populations should have a better chance of coping with chang- 
ing environments. Table 11.3.3 summarizes the results of a study designed to test 
that hypothesis experimentally [data slightly modified from (4)]. Two populations 
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Table 11.3.3 


Date Day no., x(=x*) Strain A pop”, y Strain B pop”, y* 


Feb 2 0 100 100 
May 13 100 250 203 
Aug 21 200 304 214 
Nov 29 300 403 295 
Mar 8 400 446 330 
Jun 16 500 482 324 


of fruit flies (Drosophila serrata) —one that was cross-bred (Strain A) and the other, 
in-bred (Strain B)—were put into sealed containers where food and space were kept 
to a minimum. Recorded every hundred days were the numbers of Drosophila alive 
in each population. 

Figure 11.3.5 shows a graph of the two sets of population figures. For both 
strains, growth was approximately linear over the period covered. Strain A, though, 
with an estimated slope of 0.74, increased at a faster rate than did Strain B, where 
the estimated slope was 0.45. The question is, do we have enough evidence here 
to reject the null hypothesis that the two true slopes are equal? Is the difference 
between 0.74 and 0.45, in other words, statistically significant? 
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e-- 
e or 
a y= 1453 + 0.7420 ak “a 
2 ono a aa \ 
=I ev-- pare 
2 300 P Aue 
is) ar see" 
4) e “7 ore 
3 - See 
& 200 i a 
Oo Br gee e Strain A 
ea “ - 7 A Strain B 
100 y* = 131.3 + 0.452x 


0 100 200 300 400 500 


Day number 


Figure 11.3.5 


Let a =0.05 and let (x;, y;),i=1,2,...,6, and (x*, y*),i=1,2,..., 6, denote the 
times and population sizes for Strain A and Strain B, respectively. Our objective is 
to test Ho: 6; = BY versus H, : 6; > BY. Rejecting Ho, of course, would support the 
contention that genetic variability benefits a species’ chances of survival. 

From Table 11.3.3, x = x* = 250 and 


6 6 
> (x; -x)* = y (x* — x*)? =175,000 
i=1 i=l 
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Also, 


Me 


ll 
= 


and 


[y; — (145.3 — 0.742x;)]? = 5512.14 


6 
Y— [yf — (131.3 + 0.452x})? = 3960.14 


i=1 


SO 


= 34.41 


= 3960.14 
6+6—4 


Since H; is one-sided to the right, we should reject Ap if t > tos,, = 1.8595. But 


t= 


0.742 — 0.452 


1 
4.41 
— 


1 
175,000 


= 2.50 


These data, then, do support the theory that genetically mixed populations have a 
better chance of survival in hostile environments. = 


Questions 


11.3.16. Regression techniques can be very useful in situ- 
ations where one variable—say, y—is difficult to measure 
but x is not. Once such an xy-relationship has been “cal- 
ibrated,” based on a set of (x;, y;)’s, future values of Y 
can be easily estimated using fy) + 6,x. Determining the 
volume of an irregularly shaped object, for example, is 
often difficult, but weighing that object is likely to be easy. 
The following table shows the weights (in kilograms) and 
the volumes (in cubic decimeters) of eighteen children 
between the ages of five and eight (13). The estimated 
regression line has the equation y = —0.104 + 0.988x, 
where s = 0.202. 


(a) Construct a 95% confidence interval for E(Y|14.0). 
(b) Construct a 95% prediction interval for the volume 
of a child weighing 14.0 kilograms. 


Weight,x Volume,y Weight,x Volume, y 
17.1 16.7 15.8 15.2 
10.5 10.4 15.1 14.8 
13.8 13.5 12.1 11.9 
15.7 15.7 18.4 18.3 
11.9 11.6 17.1 16.7 
10.4 10.2 16.7 16.6 
15.0 14.5 16.5 15.9 
16.0 15.8 15.1 15.1 
17.8 17.6 15.1 14.5 


11.3.17. Construct a 95% confidence interval for 
E(Y¥ |2.750) using the connecting rod data given in Case 
Study 11.2.1. 


11.3.18. For the CHD mortality data of Case Study 11.3.1, 
construct a 99% confidence interval for the expected 
death rate in a country where the cigarette consump- 
tion is 2500 per adult per year. Is a public health official 
more likely to be interested in a 99% confidence interval 
for E(Y |2500) or a 99% prediction interval for Y when 
x = 2500? 


11.3.19. The Master of Business Administration (M.B.A.) 
degree typically prepares its possessors for a high-salaried 
position, most often in business or industry. So, a reason- 
able measure of the effectiveness of an M.B.A. program 
is the median salary of its graduates five years after grad- 
uation. The table gives the tuition paid and the median 
five-year-out salary for graduates of sixteen highly ranked 
private M.B.A. programs. 


Tuition Median Salary 
University ($ thousands) ($ thousands) 
Wake Forest 71 108 
Emory 81 121 
SMU 81 122 
Georgetown 83 147 
USC 86 155 


Vanderbilt 87 128 


NYU 89 170 
Cornell 92 168 
Yale 93 160 
Duke 93 148 
Dartmouth 94 205 
Northwestern 96 165 
MIT 96 190 
Chicago 97 210 
Carnegie Mellon 98 145 
Columbia 99 182 


Source: www.forbes.com/lists/2009/95/best-business-schools-09_Best-Business- 
Schools. 


Find the 95% confidence interval for E(Y|102). Harvard’s 
tuition during this time period was $102,000. Does the 
interval include the Harvard graduate median salary of 
$215,000? 


11.3.20. In the radioactive exposure example in Ques- 
tion 11.2.9, find the 95% confidence interval for E(Y|9.00) 
and the prediction interval for the value 9.00. 


11.3.21. Attorneys representing a group of male buyers 
employed by Flirty Fashions are filing a reverse discrim- 
ination suit against the female-owned company. Central 
to their case are the following data, showing the relation- 
ship between years of service and annual salary for the 
firm’s fourteen buyers, six of whom are men. The plain- 
tiffs claim that the difference in slopes (0.606 for men 
versus 1.07 for women) is prima facie evidence that the 
company’s salary policies discriminate against men. As the 
lawyer for Flirty Fashions, how would you respond? Use 
the following sums: 


6 
Y- (i — 21.3 — 0.606x;)? = 5.983 


i=l 


and 


8 
Y > (yt = 23.2 — 1.07")? = 13.804 


i=l 


6 8 
Also, >> (x; — ¥)? = 31.33 and )° (x — x*)? =46. 
i=l 


i=l 
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$32 + 


Salary (in thousands) 


Years of service 


11.3.22. Polls taken during a city’s last two administra- 
tions (one Democratic, one Republican) suggested that 
public support of the two mayors fell off linearly with 
years in office. Can it be concluded from the following 
data that the rates at which the two administrations lost 
favor were significantly different? Let « = 0.05. (Note: y = 
69.3077 — 3.4615x with an estimated standard deviation 
of 0.9058 and y* = 59.9407 — 2.7373x* with an estimated 
standard deviation of 1.2368.) 


Democratic Mayor Republican Mayor 


Years after Percent in Years after Percent in 
Taking Office, x Support, y Taking Office, x* Support, y* 


2 63 1 58 
3 58 2 55 
5 52 4 47 
7 46 6 43 
8 41 7 41 

8 39 


1 1.3.23. Prove that the variance of Y can also be written 


oS. (x —x) 
Var(¥) = —= 
ny (x; —x)y 
i=l 


11.3.24. Show that 


> &-YP=) U1 +) G-yy 
i=l i=l i=1 


for any set of points (x;, Y;),i=1,2,...,n. 


11.4 Covariance and Correlation 


Our discussion of xy-relationships in Chapter 11 began with the simplest possible 
setup from a statistical standpoint—the case where the (x;, y;)’s are just numbers 
and have no probabilistic structure whatsoever. Then we examined the more com- 
plicated (and more “inference-friendly”) scenario where x; is a constant but Y; 
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is a random variable. Introduced in this section is the next level of complexity — 
problems where both X; and Y; are assumed to be random variables. [Measurements 
of the form (%;, y;) or (%;, Y;) are typically referred to as regression data; observations 
satisfying the assumptions made in this section—that is, measurements of the form 
(X;, Y;) are more commonly referred to as correlation data.] 


Measuring the Dependence Between Two Random Variables 


Given a pair of random variables, it makes sense to inquire how one varies with 
respect to the other. If X increases, for example, does Y also tend to increase? And if 
so, how strong is the dependence between the two? 

The first step in addressing such questions was taken in Section 3.9 with the 
definition of covariance. In that section, its role was primarily as a tool for finding the 
variance of a sum of random variables. Here, it will serve as the basis for measuring 
the relationship between X and Y. 


The Correlation Coefficient 


The covariance of X and Y necessarily reflects the units of both random variables, 
which can make it difficult to interpret. In applied settings, it helps to have a dimen- 
sionless measure of dependency so that one xy-relationship can be compared to 
another. Dividing Cov(X, Y) by oxoy accomplishes not only that objective but also 
scales the quotient to be a number between —1 and +1. 


Definition 11.4.1. Let X and Y be any two random variables. The correlation 
coefficient of X and Y, denoted p(X, Y), is given by 


Cov(X, Y) ae 
p(X, Y)= ——— =Cov(X", Y") 
Ox,Oy 


where X* = (X — wx)/ox and Y* =(Y — py) /oy. 


For any two random variables X and Y, 


a. |p(X,Y)| <1. 
b. |o(X, Y)|=1 ifand only if Y =aX +b for some constants a and b (except possibly 
on a set of probability zero). 


Proof Following the notation of Definition 11.4.1, let X* and Y* denote the 
standardized transformations of X and Y. Then 


0 < Var(X* + Y*) = Var(X*) + 2 Cov(X", Y*) + Var(Y*) 
=1+2p(X,Y)+1 
=2[1+p(X,Y)] 


But 1 + p(X, Y)>0 implies that |p(X, Y)| < 1, and part (a) of the theorem is proved. 
Next, suppose that p(X, Y) =1. Then Var(X* — Y*) =0; however, a random vari- 
able with zero variance is constant, except possibly on a set of probability zero. From 
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the constancy of X* — Y*, it readily follows that Y is a linear function of X. The case 
for p(X, Y) =—1 is similar. 
The converse of part (b) is left as an exercise. 


Questions 
11.4.1. Let X and Y have the joint pdf (x.y) fy, y) 
i 
#2) for (x, y)= (1, 1), 1,3), 2, 1), (2,3) o> 2 
fxv@, = 0 \ h (1, 3) 4 
’ elsewhere 
(2, 1) 7 
(2, 4) 
Find Cov(X, Y) and p(X, Y). 
11.4.2. Suppose that X and Y have the joint pdf Find the correlation coefficient between Xand Y. 
11.4.5. Prove that p(a+ bX,c+dY)= p(X, Y) for con- 
fxyG@,yJ=xty, O<x<10<y<l stants a,b,c, and d where b and d are positive. Note that 
this result allows for a change of scale to one convenient 


Find p(X, Y). for computation. 


11.4.6. Let the random variable X take on the values 
1,2,...,n, each with probability 1/n. Define Y to be X?. 
Find p(X, Y) and lim p(X, Y). 


11.4.3. If the random variables X and Y have the joint pdf 


8xy, O<y<x<l 
fuy(x, y)= 11.4.7. (a) For random variables X and Y, show that 


0, otherwise 
Cov(X + Y, X — Y) = Var(X) — Var(Y) 
show that Cov(X, Y) = &. Calculate p(X, Y). (b) Suppose that Cov(X, Y) =0. Prove that 


450° 
11.4.4. Suppose that X and Y are discrete random vari- px xe Var(X) — Var(Y) 
ables with the joint pdf Var(X) + Var(Y) 


Estimating p(X, Y): The Sample Correlation Coefficient 


We conclude this section with an estimation problem. Suppose the correlation coef- 
ficient between X and Y is unknown, but we have some relevant information about 
its value in the form of n measurements (X,, Y), (X2, Y2),..., and (X,, Y,). How can 
we use those data to estimate p(X, Y)? 

Since the correlation coefficient can be written in terms of various theoretical 
moments, 


E(XY) — E(X)E(Y) 


vy Var(X)./ Var(Y) 


it would seem reasonable to estimate each component of p(X, Y) with its corre- 
sponding sample moment. That is, let X and Y approximate E(X) and E(Y), replace 


E(XY) with 
1 n 
=x 
i i=1 


OX, Y= 


and substitute 
iS a 1x ae 
_ xX; —X d — Y,-—Y 
= > an po ) 


for Var(X) and Var(Y), respectively. 
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11.4.8. Derive Equation 11.4.2 from Equation 11.4.1. 


I 1.4.9. Let (x1, v1); (Xo, yo), avers 
surements whose sample correlation coefficient is r. Show 


that 


We define the sample correlation coefficient, then, to be the ratio 


ns = (11.4.1) 
Said 0-78 
i=] 


or, equivalently, 


fE-(E*) ode -E) 


(Sometimes R is referred to as the Pearson product-moment correlation coefficient, 
in honor of the eminent British statistician Karl Pearson.) 


(Xn, Yn) be a set of mea- 


where f; is the maximum likelihood estimate for the slope. 


Interpreting R 


The properties cited for p(X, Y) in Theorem 11.4.1 are not sufficient to provide a 
useful interpretation of R. What does it mean, for example, to say that the sample 
correlation coefficient is 0.73, or 0.55, or —0.24? One way to answer such a question 
focuses on the square of R, rather than on R itself. 

We know from Equation 11.3.3 that 


(9; — Bo — Bixi)? = 0; — 9) -— BP - 5) 
2 2 2 2 
i=1 i=l i=l 


Using the relationship between f, and r in Question 11.4.9—together with the fact 


n n 2 
that )° (4; — x)? = Lx - (Es ) /nwe can write 


i=1 i=1 


> Oi — Bo - Baxi)’ = > 0% yor. = ee Grae 
i=l i=l i= 


which reduces to 


3 (yi y)? 3 (i Bo Bixi?’ 
a —— (11.4.3) 
X (yi — ¥)* 
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Equation 11.4.3 has a nice, simple interpretation. Notice that 


n 

1. >> (y; — y)* represents the total variability in the dependent variable —that is, 
i=l 
the extent to which the y,’s are not all the same. 
n A A 

2. >°(v;— Bo —fixi)* represents the variation in the y;’s not explained (or 
i=l 


accounted for) by the linear regression with x. 


ds L n a 2 — 2 ' 
3. >> 0; -— 9») — ©) Gi — Bo— fix) represents the variation in the y;’s that is 

i=l i=l 

explained by the linear regression with x. 


Therefore, r? is the proportion of the total variation in the y;’s that can be attributed 
to the linear relationship with x. So, if r =0.60, we can say that 36% of the variation 
in Y is explained by the linear regression with X (and that 64% is associated with 
other factors). 


Comment The quantity r? is sometimes called the coefficient of determination. 


Case Study 11.4.1 


The Scholastic Aptitude Test (SAT) is widely used by colleges and universities 
to help choose their incoming classes. It was never designed to measure the 
quality of education provided by secondary schools, but critics and supporters 
alike seem increasingly intent on forcing it into that role. The problem is that 
average SAT scores associated with schools or districts or states reflect a vari- 
ety of factors, some of which have little or nothing to do with the quality of 
instruction that students are receiving. 

Table 11.4.1 shows one testing period’s average SAT scores (Y), by state, 
as a function of participation rate (X), where the SAT score is the sum of the 
Critical Reading, Math, and Writing subtest scores. As Figure 11.4.1 suggests, 
there appears to be a strong dependency between the two measurements—as 
a State’s participation rate goes down, its average SAT score goes up. In South 
Dakota, for example, only 3% of the students eligible to take the test actually 
did; in New York, the participation rate was a dramatically larger 84%. The 
average SAT score in New York was 1473; in South Dakota the average score 
of 1766 was 20% higher. 

A good way to quantify the overall relationship between test scores and 
participation rates is to calculate the data’s sample correlation coefficient, r. 

From Table 11.4.1, we can calculate the sums necessary to evaluate Equa- 
tion 11.4.2: 


51 51 
"= 1,891 >> yi = 81,396 
i=1 i=l 


51 51 
Six? = 114,983} y? = 130,597,738 
i=1 i=1 

51 

Y > xi¥i = 2,863,056 

i=1 


(Continued on next page) 
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(Case Study 11.4.1 continued) 


Table 11.4.1 
Participation Average Participation Average 
State Rate, x SAT Score, y State Rate, x SAT Score, y 
AL 8% 1676 MT 24% 1612 
AK 45% 1533 NE 5% 1733 
AZ 26% 1538 NV 40% 1482 
AR 5% 1701 NH 74% 1555 
CA 48% 1512 NJ 76% 1504 
CO 21% 1687 NM 12% 1645 
CT 83% 1535 NY 84% 1473 
DE 70% 1487 NC 63% 1489 
DC 84% 1390 ND 3% 1766 
FL 54% 1474 OH 24% 1599 
GA 70% 1466 OK 6% 1701 
HI 58% 1453 OR 53% 1552 
ID 18% 1597 PA 1% 1478 
IL 7% 1762 RI 66% 1486 
IN 62% 1485 SC 61% 1461 
TA 3% 1797 SD 3% 1766 
KS 7% 1733 TN 11% 1707 
KY 8% 1692 TX 50% 1473 
LA 7% 1688 UT 6% 1661 
ME 87% 1396 VT 64% 1549 
MD 69% 1498 VA 68% 1522 
MA 83% 1552 WA 52% 1568 
MI 6% 1751 WV 19% 1511 
MN 8% 1784 Wil 5% 1768 
MS 3% 1696 WY 6% 1677 
MO 5% 1775 


Source: professionals.collegeboard.com/profdownload/cbs-08-Page-3-Table-3.pdf. 
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coefficient is —0.881: 


Substituting the sums into the formula for r, then, shows that the sample 


51(2,863,056) — (1,891)(81,396) 


= —0.881 


| 
V51(114,983) — (1891)?,/51 (130,597,738) — (81,396)? 


Since r? = (—0.881)? = 0.776, we can say that 77.6% of the variability in SAT 
scores from state to state can be attributed to the linear relationship between 
test scores and participation rates. 


About the Data The magnitude of r? for these data should be a clear warning that 
comparing average SATs at face value from state to state or school system to school 
system is largely meaningless. It would make more sense to examine the residuals 
associated with y = fy + B,.x. States with particularly large positive values for y — } 
may be doing something that other states might be well advised to copy. 


Questions 


11.4.10. In Case Study 11.3.1, how much of the variability 
in CHD mortality is explained by cigarette consumption? 


11.4.1 1. Some baseball fans believe that the number of 
home runs a team hits is markedly affected by the alti- 
tude of the club’s home park. The rationale is that the 
air is thinner at the higher altitudes, and balls would be 
expected to travel farther. The following table shows the 
altitudes (X) of American League ballparks and the num- 
ber of home runs (Y) that each team hit during a recent 
season (172). Calculate the sample correlation coeffient, 
r, using the sums below. What would you conclude? 


12 
> x; = 4936 


i=1 


12 
yx? = 3,071,116 
i=l 


12 
i=l 


12 
yy? = 123,349 
i=1 


12 


So xiyi = 480,565 


i=l 


Club Altitude, x _ Number of Home Runs, y 
Cleveland 660 138 
Milwaukee 635 81 
Detroit 585 135 
New York 55 90 
Boston 21 120 
Baltimore 20 84 
Minnesota 815 106 
Kansas City 750 57 


Chicago 595 109 


Texas 435 74 
California 340 61 
Oakland 25 120 


11.4.12. The following table shows U.S. corn supplies (in 
millions of bushels) and corn prices (dollars per bushel 
rounded to the nearest $0.10) for the years 1999 through 
2008. Calculate the sample correlation coefficient, r. The 
sums for the data in the table are: 


10 10 
Yo x; = 123.1 > y; = 25.80 
ort a 
yo x? = 1529.63 > y? = 74.00 
i=1 i=l 
10 
Y > xiyi = 325.08 
1 
Year Supply, x Price, y 
1999 11.2 $1.70 
2000 11.8 1.80 
2001 11.5 2.00 
2002 10.6 2.40 
2003 11.2 2.50 
2004 12.8 2.10 
2005 13.1 2.00 
2006 12.6 3.00 
2007 14.5 4.20 
2008 13.8 4.10 


Source: USDA WASDE report 1.12.10, www.agmanager.info. 
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11.4.13. The extent to which stress is a contributing fac- 
tor to the severity of chronic illnesses was the focus of the 
study summarized in the following table (208). Seventeen 
conditions were compared on a Seriousness of Illness Rat- 
ing Scale (SIRS). Patients with each of these conditions 
were asked to fill out a Schedule of Recent Experience 
(SRE) questionnaire. Higher scores on the SRE reflect 
presumably greater levels of stress. How much of the vari- 
ation in the SIRS values can be attributed to the linear 
regression with SRE? 


Admitting Diagnosis Average SRE,x SIRS, y 
Dandruff 26 21 
Varicose veins 130 173 
Psoriasis 317 174 
Eczema 231 204 
Anemia 325 312 
Hyperthyroidism 816 393 
Gallstones 563 454 
Arthritis 312 468 
Peptic ulcer 603 500 
High blood pressure 405 520 
Diabetes 599 621 
Emphysema 357 636 
Alcoholism 688 688 
Cirrhosis 443 733 
Schizophrenia 609 776 
Heart failure 772 824 
Cancer T7T7 1020 


Use the following sums: 


17 17 
Yo x; =7,973 > y; = 8,517 


i=l i=l 


7 17 
x? =4,611,291 So y?=5,421,917 
= i=l 


17 

Y > xi¥j =4,759,470 

i=1 
11.4.14. Among the many strategies that investors use 
to try to predict trends in the stock market is the “early 


warning” system, which is based on the premise that what 
the market does in the first week in January is indicative 
of what it will do over the next twelve months. Listed 
in the following table for the eighteen years from 1991 
through 2008 are x, the percentage change in the Dow 
Jones Industrial Average for the first week in January, and 
y, the percentage change for the entire year. Quantify the 
strength of the linear relationship between X and Y. Use 
the following sums: 


18 18 
yx, =-0.9 Yi y?=160.2 
i=1 i=] 


18 
yx? = 92.63 


i=l 


18 

>. y? = 6437.68 
i=l 

18 


Y> xiyi = 221.37 


i=l 


% Change for 


First Week in % Change 
Year January, x for Year, y 
1991 —2.6 24.8 
1992 —0.1 1.6 
1993 —1.5 15.7 
1994 1.7 2.1 
1995 0.9 33.5 
1996 1.2 28.2 
1997 2.4 21.7 
1998 —5.1 21.1 
1999 4.8 25.5 
2000 0.2 —6.2 
2001 S132 —6.0 
2002 —2.7 —16.2 
2003 2.1 21.0 
2004 0.5 1.9 
2005 -1.7 —0.6 
2006 2.2 16.3 
2007 —0.5 7.2 
2008 —1.5 —31.4 


Source: finance.yahoo.com/q/hp?s=%SEDJI. 


11.5 The Bivariate Normal Distribution 


The singular importance of the normal distribution in univariate inference proce- 
dures should, by now, be abundantly clear. In dealing with problems that involve 
two random variables—for example, the calculation of p(X, Y)—it should come as 
no surprise that the most frequently encountered joint pdf, fx y (x, y), is a bivariate 
version of the normal curve. Our objectives in this section are twofold: (1) to deduce 
the form of the bivariate normal from basic principles and (2) to identify the partic- 
ular properties of that pdf that pertain to the problem of assessing the nature of the 
dependence between X and Y. 
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Generalizing the Univariate Normal pdf 


At this point, we know many things about the univariate normal pdf, 


Sections upon sections have been devoted to estimating and testing its parameters, 
studying its transformations, and learning about its role as an approximation to the 
distribution of sums and averages. What has not been discussed is the generalization 
of fy(y) itself, to a bivariate, trivariate, or multivariate pdf. 

Given the mathematical complexities inherent in the univariate normal pdf, it 
should come as no surprise that its extension to higher dimensions is not a simple 
matter. In the bivariate case, for example, which is the only generalization we will 
consider, fx,y(x, y) has five different parameters and its functional form is decidely 
unpleasant. 

We will begin by “constructing” a bivariate normal pdf, fx y (x, y), using proper- 
ties suggested by what we already know holds true for the univariate normal, fy(y). 
As a first condition to impose, it seems reasonable to require that the marginal 
pdfs associated with fy y(x, y) be univariate normal densities. It will be sufficient 
to consider the case where the two marginals are standard normals. 

If X and Y are independent standard normal random variables, 


(242) —0 <x <00 


14 
Fxv@ Y= xe 2 A cmigevenyce oe (11.5.1) 


Notice that the simplest extension of fy y(x, y) in Equation 11.5.1 is to replace 
—$(x? + y*) with —te(x? +uxy-+ y?), or, equivalently, with —4e(x? — 2vxy + y?), 
where c and v are constants. The desired joint pdf, then, would have the general 
form 


fxy@, y= Ken 268 20x99) (11.5.2) 

where K is the constant that makes the double integral of fx,y(x, y) from —oo to oo 
equal to 1. 

Now, what must be true of K, c, and v if the marginal pdfs based on fy y(x, y) 


are to be standard normals? Note, first, that completing the square in the exponent 
makes 


x? —2uxyt y? =x? —v*x? + (y? — 2uxy + v*x?) 
=(1—v7)x?+(y—vx)’ 


re) 
fxv, y= Ken 20-0)? 5 gly vay? 


The exponents, though, must be negative, which implies that 1 — v? > 0, or, equiva- 
lently, |v| <1. 
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To find K, we start by calculating 


[o,e) [oe] 
/ / oe /2e=w?)x? | 4 (1/2)e(y-w8? Jy x 
—oo J—00 


CO CO “ 
= er (l/2e(1-v?)x? | eee ay 
—CO —0o 


ve en (l/De(1— vx? (: =| ay 


—00 Je 
_ Vin Vin 
se JevI=v? 
an 
~ eVvT—w? 

It follows that 
oe ceV1—v? 
20 


The constant c can be any positive value, but a convenient choice proves to be 
c=1/(1 —v’). Substituting K and c, then, into Equation 11.5.2 gives 


2 2 ) [ay2 
Py (Ce) = eo A/DU/A-v?)}a?—2vxy+y?) 


1 
Inv 1—v? 
1 
=e =e x 9-1/2) /A—-v)(y—vx? (11.5.3) 


Recall that our choice of the form of fx y (x, y) was predicated on a wish for the 
marginal pdfs to be normal. A simple integration shows that to be the case: 


oe / Fads 


1 
Sie Tae 
7 1 
©” Bf —ae 
1 


2 
a e- /2x 


Co 
ag) e F/ADIL/A—wNO—v2)” dy 
—0o 


e7 (1/2)x? V2nV1— v2 


J20 


Since fy y(x, y) is symmetric in x and y, fy(y) is also the standard normal. 


The constant v is actually the correlation coefficient between X and Y. Since 
E(X) = E(Y) =0 and oy =oy = 1, 


p(X, y= Bcxyy= | / xy fx.y(x, y)dx dy 


1 oe 2 1 ee. 2: 2 
= —(1/2)x —(1/2)[1/U—v*)](y—vx) 
In be + =a —v is ‘| . 
= 7a im xe UD yx dx (why?) 
I J—oo 
1 
V 20 


=U 


(oe) 
/ x2@7 UD" dy = y Var(X) = v 
—0oO 


Theorem 
11.5.1 
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Finally, we can replace x with (x — wx)/ox and y with (y — wy)/oy. Doing 
so requires that the original pdf be multiplied by the derivative of both the 


[see (102)]. 


X-transformation and the Y-transformation— that is, by 
OxOy 


Definition 11.5.1. Let X and Y be random variables with joint pdf 


1 
fx,y@, y) = ———_———— 
2moxoy+/ 1—p? 


1 1 (x — px)? x—px y-py  (y—pmy)* 
- exp 20 . + 
2\1—p? ox ox oy oF 


for all x and y. Then X and ¥ are said to have the bivariate normal distribution 
(with parameters wy, of, Wy, of, and p). 


Comment For bivariate normal densities, p(X, Y) =0 implies that X and Y are 
independent, a result not true in general. 


Properties of the Bivariate Normal Distribution 


Francis Galton, the renowned British biologist and scientist, perhaps more than 
any other person was responsible for launching regression analysis as a worth- 
while field of statistical inquiry. Galton was a redoubtable data analyst whose keen 
insight enabled him to intuit much of the basic mathematical structure that we now 
associate with correlation and regression. 

One of his more famous endeavors (58) was an examination of the relation- 
ship between parents’ heights (X) and their adult children’s heights (Y). Those 
particular variables have a bivariate normal distribution, the mathematical prop- 
erties of which Galton knew nothing. Just by looking at cross-tabulations of X and 
Y, though, Galton postulated that (1) the marginal distributions of X and Y are 
both normal, (2) E(Y |x) is a linear function of x, and (3) Var(Y |x) is constant 
with x. As Theorem 11.5.1 shows, all of his empirically based deductions proved to 
be true. 


Suppose that X and Y are random variables having the bivariate normal distribution 
given in Definition 11.5.1. Then 


a. fx(x) is anormal pdf with mean 1x and variance 0%; fy(y) is anormal pdf with 
mean |ty and variance o;. 
b. ¢ is the correlation coefficient between X and Y. 
poy 
c. E(Y |x)=py + —(x— px). 
Ox 


d. Var(Y |x) =(1— p”)o?. 


Proof We have already established (a) and (b). Properties (c) and (d) will be exam- 
ined for the special case wx = wy = 0 and o, =o, = |. The extension to arbitrary 
[Lx, Ly, Ox, and oy is straightforward. 
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Questions 


11.5.1. Suppose that X and Y have a bivariate normal 


First, note that 


fx.y(, y) 
fy\x(y) = 
; fx@) 
1 g—(1/2)x g—1/2)L1/ =p (y= px)? 
_ on 1—p? 
= SL ¢-a/2x? 
dime 
1 


zs eo E/E =p7) (yx)? 

J 21/1 — p? 
By inspection, we see that Equation 11.5.4 is the pdf of a normal random variable 
with mean px and variance 1 — p*. Therefore, E(Y |x) = px and Var(Y |x) = 1—p?. 
Replacing y with (y — wy)/oy and x with (x — wy)/ox gives the desired results. 


(11.5.4) 


Comment The term regression line derives from a consequence of part (c) of The- 
orem 11.5.1. Suppose we make the simplifying assumption that wy = wy = w and 
ox = oy. Then part (c) reduces to 


E(Y |x) -w=p(X, Y)(x— p) 


But recall that |o(X, Y)| < 1—and, in this case, 0 < p(X, Y) < 1. Here, the positive 
sign of p(X, Y) tells us that, on the average, tall parents have tall children. How- 
ever, p(X, Y) < 1 means (again, on the average) that the children’s heights are closer 
to the mean than are the parents’. Galton called this phenomenon “regression to 
mediocrity.” 


11.5.4. Suppose that the random variables X and Y have a 


pdf with Ux 3. by 6, OE 


4, oy =10, and p=}. Find _ bivariate normal pdf with wy =56, wy = 11, 0% = 1.2, oF 


P(5<Y <6$) and P(5<Y <6$|x=2). 
11.5.2. Suppose that X and Y have a bivariate normal 


2.6, and p =0.6. Compute P(10 < Y < 10.5|x =55). Sup- 
pose that n = 4 values were to be observed with x fixed at 
55. Find P(10.5 < ¥Y <11|x =55). 


distribution with Var(X) = Var(Y). 


11.5.5. If the joint pdf of the random variables X and Y is 


(a) Show that X and Y — eX are independent. 


(b) Show that X + Y and X — Y are independent. [Hint: 
See Question 11.4.7(a).] 


11.5.3. Suppose that X and Y have a bivariate normal 
distribution. 


(a) Prove that X + Y has a normal distribution when X 
and Y are standard normal random variables. 

(b) Find E(cX + dY) and Var(cX+dY) in terms of 
[Ly, Ly, Ox, oy, and p(X, Y), where X and Y are arbi- 
trary normal random variables. 


= oo rgd 
fer, y)=ke (2/3)[0./4)x*—-(1/2)xy+y*] 


find E(X), E(Y), Var(X), Var(Y), p(X, Y), and k. 
11.5.6. Give conditions on a > 0, b > 0, and u so that 
fe y(x, y) = ken ar Quy +by?) 
is the bivariate normal density of random variables X and 


Y each having expected value 0. Also, find Var(X), Var(Y), 
and p(X, Y). 


Estimating Parameters in the Bivariate Normal pdf 


The five parameters in fx y(x, y) can be estimated in the usual way with the method 
of maximum likelihood. Given a random sample of size n from fx.y(x, y)—(11, y1), 


Theorem 
11.5.2 


Theorem 
11.5.3 


Example 
11.5.1 
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(x2, Y2),-+-, (Xn, Yn) —We define L = I] Jtx.y (i, y;) and take the derivatives of In L 


i=l 
with respect to each of the parameters. Solved simultaneously, the resulting five 
equations (each derivative set equal to 0) yield the maximum likelihood estimators 
given in Theorem 11.5.2. Details of the derivation will be left as an exercise. 


Given that fx.y(x, y) is a bivariate normal pdf, the maximum likelihood estima- 
tors for x, by, cy oF, and p, assuming that all five are unknown, are X, Y, 


1\.2 - 1 \.2t = 
(+) > (% — X), (<) > (¥; — Y)’, and R, respectively. 
i=l Nf j= 


n i=l 


Testing Ho: p =0 


If X and Y have a bivariate normal distribution, testing whether the two variables 
are independent is equivalent to testing whether their correlation coefficient, p, 
equals 0 (recall the Comment following Definition 11.5.1). Two different procedures 
are widely used for testing Ho: 0 =0. One is an exact test based on the 7,2 random 
variable given in Theorem 11.5.3; the other is an approximate test based on the 
standard normal distribution. 


Let (X1, Y1), (X2, Y2),--., (Xn, Yn) be arandom sample of size n drawn from a bivari- 
ate normal distribution, and let R be the sample correlation coefficient. Under the null 
hypothesis that p = 0, the statistic 


r _vn—-2R 
J Tame 


has a Student t distribution with n —2 degrees of freedom. 


Proof See (49). 


Table 11.5.1 gives the mean temperature for twenty successive days in April and the 
average daily butterfat content in the milk of ten cows (138). Can we conclude that 
temperature and butterfat content have a nonzero correlation? 

Let p denote the true correlation coefficient between X and Y. The hypotheses 
to be tested are 


Ao: p=0 
versus 


A: p #0 


Let a =0.05. Given that n = 20, the statistic 


fe n—2-r 
~ VTP 


follows a Student r distribution with 18 df (if Ho: o = 0 is true). That being the case, 
the null hypothesis will be rejected if ¢ is either (1) < —2.1009 (= —to.025,13) or (2) = 
+2.1009 (= 10,025, 18)- 
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Table 11.5.1 
Date Temperature, x Percent Butterfat, y 

April 3 64 4.65 
4 65 4.58 

5 65 4.67 

6 64 4.60 

7 61 4.83 

8 55 4.55 

9 39 5.14 

10 41 4.71 

11 46 4.69 

12 59 4.65 

13 56 4.36 

14 56 4.82 

15 62 4.65 

16 37 4.66 

17 37 4.95 

18 45 4.60 

19 57 4.68 

20 58 4.65 

21 60 4.60 

22 55 4.46 


For the data in Table 11.5.1, 


20 20 
>| x; = 1,082 YH 35 
i=l i=1 


20 20 

y > x7=60304 > y?=437.6406 

i=1 i=1 
20 
Y > xy; = 5,044.5 
i=1 

SO 
20(5,044.5) — (1,082) (93.5) 
r= 
/20(60,304) — (1,082)?,/20(437.6406) — (93.5)? 


= —0.453 


Therefore, 
7 Mba? er V18(—0.453)_ 
VI=r2 /1— (—0.453)2 


and our conclusion is reject H)—it would appear that temperature and butterfat 
content are not independent. 


2.156 


Comment An alternate approach to testing Ho: p =0 was given by Fisher (46). He 
showed that the statistic 

1, 14+R 

a n ——$—_, 

2 11-R 
is asymptotically normal with mean 5 In[( + e)/C1 — p)] and variance approximately 
1/(n — 3). Fisher’s formulation makes it relatively easy to determine the power of a 
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correlation test—a computation that would be much more difficult if the inference 


had to be based on /n — 2 R/V1— R?. = 


Questions 


11.5.7. What would the conclusion be for the test of 
Example 11.5.1 if @=0.01? 


11.5.8. In a study of heart disease (73), the weight (in 
pounds) and the blood cholesterol (in mg/dl) of four- 
teen men without a history of coronary incidents were 
recorded. At the a = 0.05 level, can we conclude from 
these data that the two variables are independent? 


Subject Weight,x Cholesterol, y 
1 168 135 
2 175 403 
3 173 294 
4 158 312 
5 154 311 
6 214 222 
7 176 302 
8 262 269 
9 181 311 

10 143 286 
11 140 403 
12 187 244 
13 163 353 
14 164 252 


The data in the table give the following sums: 


14 
yx; =2, 458 
i=l 


14 
ys y; = 4, 097 
i=l 


14 14 
yix2= 444,118 dy? = 1,262,559 
i=l i=l 

14 


Yo xy = 710,499 


i=l 


11.5.9. Recall the baseball data in Question 11.4.11. Test 
whether home run frequency and home park altitude are 
independent. Let a = 0.05. 


11.5.10. Test Ho: op =0 versus H;: po £0 for the SRE/SIRS 
data described in Question 11.4.13. Let 0.01 be the level of 
significance. 


11.5.1 1. The National Collegiate Athletic Association has 
had a long-standing concern about the graduation rate 
of athletes. Under the urging of the Association, some 
prominent athletic programs increased the funds for tutor- 
ing athletes. The table below gives the amount spent (in 
millions of dollars) and the resulting percentage of ath- 
letes graduating in 2007. Test Hp : o =0 versus H,: p >0O at 
the 0.10 level of significance. 


Money Spent on Graduation 
University Athletes Tutoring,x Rate 2007, y 
Minnesota 1.61 72 
Kansas 1.61 70 
Florida 1.67 87 
LSU 1.74 69 
Georgia 1.77 70 
Tennessee 1.83 78 
Kentucky 1.86 73 
Ohio St. 1.89 78 
Texas 1.90 72 
Oklahoma 2.45 69 


Source: Pensacola News Journal (Florida), December 21, 2008. 


11.6 Taking a Second Look at Statistics (How Not to 
Interpret the Sample Correlation Coefficient) 


Of all the “numbers” that statisticians and experimenters routinely compute, the 
correlation coefficient is one of the most frequently misinterpreted. Two errors in 
particular are common. First, there is a tendency to assume, either implicitly or 
explicitly, that a high sample correlation coefficient implies causality. It does not. 
Even if the linear relationship between x and y is perfect—that is, even if r= —1 
or r =+1—we cannot conclude that X causes Y (or that Y causes X). The sample 
correlation coefficient is simply a measure of the strength of a linear relationship. 
Why the xy-relationship exists in the first place is a different question altogether. 
George Bernard Shaw (an unlikely contributor to a mathematics text!) 
described elegantly the fallacy of using statistical relationships to infer underlying 
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Figure 11.6.1 


causality. Commenting on the “correlations” that exist between lifestyle and health, 
he wrote in The Doctor’s Dilemma (163): 


It is easy to prove that the wearing of tall hats and the carrying of umbrellas enlarges 
the chest, prolongs life, and confers comparative immunity from disease; for the 
statistics show that the classes which use these articles are bigger, healthier, and live 
longer than the class which never dreams of possessing such things. It does not take 
much perspicacity to see that what really makes this difference is not the tall hat and 
the umbrella, but the wealth and nourishment of which they are evidence, and that 
a gold watch or membership of a club in Pall Mall might be proved in the same way 
to have the like sovereign virtues. A university degree, a daily bath, the owning of 
thirty pairs of trousers, a knowledge of Wagner’s music, a pew in church, anything, in 
short, that implies more means and better nurture than the mass of laborers enjoy, 
can be statistically palmed off as a magic-spell conferring all sorts of privileges. 


Examples of “spurious” correlations similar to those cited by Shaw are dis- 
turbingly commonplace. Between 1875 and 1920, for example, the correlation 
between the annual birthrate in Great Britain and the annual production of pig 
iron in the United States was an almost “perfect” —0.98. High correlations have 
also been found between salaries of Presbyterian ministers in Massachusetts and the 
price of rum in Havana and between the academic achievement of U.S. schoolchil- 
dren and the number of miles they live from the Canadian border. All too often, 
what looks like a cause is not a cause at all, but simply the effect of one or more fac- 
tors that were not even measured. Researchers need to be very careful not to read 
more into the value of r than the number legitimately implies. 

The second error frequently made when interpreting sample correlation coeffi- 
cients is to forget that r measures the strength of a Jinear relationship. It says nothing 
about the strength of a curvilinear relationship. Computing r for the points shown 
in Figure 11.6.1, for example, is totally inappropriate. The (x;, y;) values in that scat- 
terplot are clearly related but not in a linear way. Quoting the value of r would be 
misleading. 


y e 
e 
e @ e e 
°-.° e % e 
e. ee 
e 
ee oc” 
ae? ae 


The lesson to be learned from Figure 11.6.1 is clear—always graph the data! 
No correlation coefficient should ever be calculated (much less interpreted) without 
first plotting the (x;, y;)’s to make certain that the underlying relationship is linear. 
Digital cameras have probably rendered photographs useless as evidence in a court 
of law, but for a statistician, a picture is still worth a thousand words. 


Appendix 11.A.1 Minitab Applications 


If a set of x;’s has been entered in Column C1 and the associated y;’s in Column C2, 
the Minitab command 


MTB >regress c2 1 cl 


Figure 11.A.1.1 
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will compute the estimated regression line, y = By + 61x, and provide the calculations 
for testing Ho: 6; =0 and Ho: By =0. Also printed out automatically will be r? and s, 
the square root of the unbiased estimate for o” in the simple linear model. Subcom- 
mands are available for plotting the data, calculating and graphing the residuals, and 
constructing confidence intervals and prediction intervals. 

Figure 11.A.1.1 is the printout of the REGRESS command applied to the Sales 
versus Revenue data described in Case Study 11.3.2. Included is a listing of the 
residuals (in Column C3). 

The entries in the “SE Coef” column are based on parts (c) and (d) of Theo- 
rem 11.3.2. The value 0.006677, for example, is the estimated standard deviation of 
the estimated slope. That is, 


0.006677 = 7 


> @i - #7" 


i=l 


where s = 50.3489 (as listed on the printout). The last entry in the “T” column is the 
value of T,_2 from Theorem 11.3.4 when 6; = 0. That is, 


0.413520 — 0 
0.006677 


61.93 = 


As we have seen in earlier chapters, the “conclusions” of hypothesis tests per- 
formed by computer software packages are invariably couched in terms of P-values. 
Here, for example, the test of Ho: Bp =0 versus H;: Bo £0 yields an observed t ratio 
of 0.52, for which the P-value is 0.627. Since the latter is so large, we would fail to 
reject Hp: By = 0 at any reasonable level of a. 


MTB >set cl 

DATA > 1687 2178 2649 3289 4076 5294 6369 7787 9411 
DATA > end 

MTB >set cl 

DATA > 748 962 1113 1350 1686 2199 2605 3179 3999 
DATA > end 

MTB >regress c2 1 cl; 

SUBC > residuals c3. 


Regression Analysis: C2 versus Cl 


The regression equation is 
c2 = 18.6 + 0.414 Cl 


Predictor Coef SE Coef Ay P 
Constant 18.57 35.87 0.52 0.621 
C1 0.413520 0.006677 61.93 0.000 
S = 50.3489 R-Sq = 99.8% R-Sq(adj) = 99.8% 


MTB S print cl 62 e323 


Data Display 


Row C1 C2 cS 
1 1687 748 31.8218 
2 2178 962 42.7834 
3 2649 1113 -0.9845 
4 3289 1350 -28.6373 
5 4076 1686 -18.0775 
6 5294 2199 -8.7449 
7 6369 2605 -47.2789 
8 7787 3179 -59.6502 
9 9411 3999 88.7933 
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If SUBC > predict “x” is appended to the “regress c2 1 cl” command, Minitab 
will print out the 95% confidence interval for E(Y |x) and the 95% prediction inter- 
val for Y at the point x. Figure 11.A.1.2 shows the input and output that provide 
these computations. 


Figure 11.A.1.2 MTB >set cl 
DATA > 1687 2178 2649 3289 4076 5294 6369 7787 9411 
DATA > end 


MTB >set cl 

DATA > 748 962 1113 1350 1686 2199 2605 3179 3999 
DATA > end 

MTB >regress c2 1 cl; 

SUBC > predict 9700. 


Predicted Values for New Observations 


New 
Obs Fit SE Fit 95% CI 95% PI 
1 4029.7 37.1 (3942.1, 4117.4) (3881.9, 4177.6) 


Doing Linear Regression Using Minitab Windows 


Enter the x;’s in Cl and the y,’s in C2. 

Click on STAT, then on REGRESSION, then on second REGRESSION. 
Type C2 in RESPONSE box. Then click on PREDICTOR box and type C1. 
Click on OK. 

To display the line, click on STAT, then on REGRESSION, then on FITTED 
LINE PLOT. 

6. Type C2 in RESPONSE box and Cl in PREDICTOR box. 

7. Click on LINEAR; then click on OK. 


Web 


Appendix 11.A.2 A Proof of Theorem 11.3.3 


The strategy for the proof is to express ng in terms of the squares of normal ran- 
dom variables and then apply Fisher’s Lemma (see Appendix 7.A.2). The random 


variables to be used are B, — Bi, W; =Y; — Bo — Bixi,i=1,...,n,andW= : > Wie= 
Y — Bo — Bix. Note that i‘ 
W; — W =(¥; — Y) — Bi —x) 
or, equivalently, 
¥;—¥ = (Wi — W) + Bia — 5) 
Next, we express B 1 — f; as a linear combination of the W;’s. The argument 
begins by using Equation 11.3.1 to express B,: 
YG; — 3% —F) 
B,-h=— Bi 


n 
(xj — x)? 
=1 


U 


Yi 2D D6, eG _ 3? 


Oy — 3)? 
i=l 


Appendix 11.A.2 A Proof of Theorem 11.3.3 593 


Oy — DLW; — W) + Bi; — DI — Br YO — 
i=l i=l 


Di OR= a) 
i=l 
D Os —)(Wi — W) 
oo (11.A.2.1) 
Gx)? 
i=1 
Recall from Equation 11.3.3 that 
ne? = (¥%i-¥°— BY i — 3 (11.A.2.2) 
i=1 i=l 


We need to express Equation 11.A.2.2 in terms of the W;’s—that is, 


n 


né? =) ((W, — W) + Bia — DP - By D> i — 9? 
i=1 


i=1 


=> (WW)? +261 >> (i — ¥)(Wi — W) + BP i — 8)? 


i=l i=l i=l 
a2 . 
—B, >> (i —3) (11.A.2.3) 
i=l 
From Equation 11.A.2.1, we can write 


Yo — DW; — W) = (B, - 6) Di - 8” 


i=l i=l 


Substituting the right-hand side of the preceding expression for )* (x; —)(W; — W) 
i=l 


in Equation 11.A.2.3 gives 


né* =) 0 (Wi — W)? +261(B) — Bi) Yi — 8° 


i=1 i=1 
n _ a2 n 7 
+B > (i —%)°- By D0 i - 3) 
i=1 i=1 


a2 


= Dom — WY + D7 er — 3)7[281(B1 — Bi) + BF - B1] 


= >* (Ww, — Wo i — ?[B) — 288: + 82] 


=>° (Wi - WY - 32-28) - BY” 


= 0 W? -nW? — > Oi — 878, - 81)? 
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Now, choose an orthogonal matrix, M, whose first two rows are 


Xy—-X 


Gia or (x; — x)? 


i=1 


and 
ee 2 
Define the random variables Z),..., Z, ae the transformation 
Zi WwW, 
: J=M] : 
Zn W,, 


By Fisher’s Lemma, the Z;’s are independent, normal random variables with 
mean zero and variance o”, and 


Zw 
i=l i=l 
Also, by Equation 11.A.2.1 and the choice of the first row of M, 
22 =~ 4; — 2)*(B) — fi)” 
i=l 


and, by the selection of the second row of M, 


Z5 = nw 


Thus, 
ne =Dw Ze “=z 


From this follows the independence of no, B B,,and Y. 
Finally, notice that 


The fact that the sum has a chi square distribution with n — 2 degrees of freedom 
proves the last part of the theorem. 


Chapter 
THE ANALYSIS OF VARIANCE 


12.1 Introduction Appendix 12.A.1 Minitab Applications 

12.2 The F Test Appendix 12.A.2 A Proof of Theorem 12.2.2 

12.3 Multiple Comparisons: Tukey’s Method : ee ee SSTR/(k — 1) 
12.4 Testing Subhypotheses with Contrasts pppendie 1ziAe The Dismunonol SSE/(n—k) 


12.5 Data Transformations When H, Is True 
12.6 Taking a Second Look at Statistics (Putting 

the Subject of Statistics Together— The 

Contributions of Ronald A. Fisher) 


“No aphorism is more frequently repeated in connection with field trials, than that we 

must ask Nature few questions or, ideally, one question, at a time. The writer is 

convinced that this view is wholly mistaken. Nature, he suggests, will best respond to 

a logical and carefully thought-out questionnaire; indeed, if we ask her a single 

question, she will often refuse to answer until some other topic has been discussed.” 
—Ronald A. Fisher 


12.1 Introduction 


In this chapter we take up an important extension of the two-sample location prob- 
lem introduced in Chapter 9. The completely randomized one-factor design is a 
conceptually similar k-sample location problem, but one that requires a substantially 
different sort of analysis than its prototype. Here, the appropriate test statistic turns 
out to be a ratio of variance estimates, the sampling behavior of which is described 
by an F distribution rather than a Student t. The name attached to this procedure, 
in deference to the form of its test statistic, is the analysis of variance (or ANOVA 
for short). A very flexible method, the analysis of variance is applied to many other 
experimental designs as well, a particularly important one being the randomized 
block design covered in Chapter 13. 


Comment Credit for much of the early development of the analysis of variance 
goes to Sir Ronald A. Fisher. Shortly after the end of World War I, Fisher resigned 
a public school teaching position that he was none too happy with and accepted a 
post at the Rothamsted Agricultural Experiment Station, a facility heavily involved 
in agricultural research. There he found himself entangled in problems where differ- 
ences in the response variable (crop yields, for example) were constantly in danger 
of being obscured by the high level of uncontrollable heterogeneity in the experi- 
mental environment (different soil qualities, drainage gradients, and so on). Quickly 
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seeing that traditional techniques were hopelessly inadequate under these condi- 
tions, Fisher set out to look for alternatives and in just a few years succeeded 
in fashioning an entirely new statistical methodology, a panoply of data-collecting 
principles and mathematical tools that is today known as experimental design. The 
centerpiece of Fisher’s creation—what makes it all work —is the analysis of variance. 


Suppose an experimenter wishes to compare the average effects elicited by k 
different levels of some given factor, where k is greater than or equal to 2. The 
factor, for example, might be “stop-smoking” therapies and the levels, three spe- 
cific methods. Or the factor might be crowdedness as it relates to aggression in 
captive monkeys, with the levels being five different monkey-per-square-foot den- 
sities in five separate enclosures. Still another example might be an engineering 
study comparing the effectiveness of four kinds of catalytic converters in reducing 
the concentrations of harmful emissions in automobile exhaust. Whatever the cir- 
cumstances, data from a completely randomized one-factor design will consist of k 
independent random samples of sizes 11, n2,..., and nx, the total sample size being 


k 
denoted n | = )° n; }. We will let Y;; represent the ith observation recorded for the 
j=l 


jth level. Table 12.1.1 shows some additional terminology. (Note: To simplify nota- 
tion in the next two chapters, data will always be written as random variables —that 
is, as Y;; rather than y;;.) 

The dot notation of Table 12.1.1 is standard in analysis of variance problems. 
The presence of a dot in lieu of a subscript indicates that particular subscript has 
been summed over. Thus the response total for the jth sample is written 


nj 
Ti (= Vij + Yoj +--+ + Yn, i) 


i=l 


and the corresponding sample mean becomes Y ;, where 


Table 12.1.1 
Treatment Level 
1 2 k 

Yu Yio Nix 

Yo) Yoo 

Yn, at! Yi52 Vigk 
Sample sizes: n Ny a Nk 
Sample totals: T T> Tx 
Sample means: Yi Y, Y, 
True means: Ly [2 [A 
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By the same convention, 7, and Y, will denote the overall total and overall mean, 
respectively: 


Appearing at the bottom of Table 12.1.1 are a set of true means, [11, M2, ..-, Lk: 
Each yw; is an unknown location parameter reflecting the true average response 
characteristic of level 7. Often our objective will be to test the equality of the 
2; ’S—that is, 


Ap: by) = ba =...= UK 
versus 


Hi: not all the yz;’s are equal 


In the next several sections we will propose a variance-ratio statistic for testing 
Ho, investigate its sampling behavior under both Hp and Hj, and introduce a set of 
computing formulas to simplify its evaluation. We will also explore the possibility of 
testing subhypotheses about the ;’s—for example, Ho: 4; = 4; (irrespective of the 
other j1;’s) or Ho: 3 = (4 + Ms)/2. 


12.2 The F Test 


To derive a procedure for testing Ho: 41 = U2 =...= We, we could once again invoke 
the generalized likelihood ratio criterion, compute 1 = L(w,)/L(Q.), and begin the 
search for a monotonic function of 4 having a known distribution. But since we have 
already seen several examples of formal GLRT calculations in Chapters 7 and 9, the 
benefits of doing another would be marginal. Deducing the test statistic on intuitive 
grounds will be more instructive. 

The data structure for a completely randomized one-factor design was outlined 
in Section 12.1. To that basic setup we now add a distribution assumption: The Y;;’s 
will be presumed to be independent and normally distributed with mean pj, j = 
1,2,...,k, and variance o? (constant for all j)—that is, 


1 1 (4 y 
fy, = e °F, -—CO<y<co 
J/2n0 

In analysis of variance problems—as was true in regression problems— 
distribution assumptions are usually expressed in terms of model equations. In the 
latter, the response variable is represented as the sum of one or more fixed com- 
ponents and one or more random components. Here, one possible model equation 
would be 


Vij = My + &ij 


where ¢;; denotes the “noise” associated with Y;; that is, the amount by which Y¥;; 
differs from its expected value. Of course, from the distribution assumption on Y;;, 
it follows that ¢;; is also normal with variance o”, but with mean zero. 
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Theorem 
12.2.1 


We will denote the overall average effect associated with the n observations in 
1 & 2 ; 
the sample by the symbol jz, where » = — > n;p;. If Hp is true, of course, jz is the 
n j=l 


value that each of the jz;’s equals. 


Sums of Squares 


To find an appropriate test statistic, we begin by estimating each of the y;’s. For 

each j, Y1j;, Yoj,..., Ynjj is a random sample from a normal distribution. By Exam- 

_ as as 25 

ple 5.2.4, the maximum likelihood estimator of 1; is Y_;. Then — )) njY_;=Y. is the 
nN j=1 


Jj 
obvious choice to estimate jw. It follows that 


SSTR = 3 Ss (Y,;-Y.)= 3 nj(¥;-Y.) 


j=l isl 


which is called the treatment sum of squares, estimates the variation among the ju;’s. 
[If all the 4;’s were equal, the Y ;’s would be similar (to Y_.) and SSTR would be 
small.] 

Analyzing the behavior of SSTR requires an expression relating the Y ;’s and 
Y., to the parameter jw. But 


= 2s pw) +(¥.-n) -2%;-n)(¥. -H)] 
= Ee) +H) —2(¥ ~#) Dami(% iH) 


=)" n,(¥.j;-n)’—n(¥.-n)’ (12.2.1) 


Now, with Equation 12.2.1 as background, Theorem 12.2.1 states the connection we 
are looking for—that the expected value of SSTR increases as the differences among 
the j;’s increase. 


Let SSTR be the treatment sum of squares defined for k independent random samples 
of sizes nj,n2,...,and ny. Then 


k 
E(SSTR) = (k — 1)o? + So nj(uj — wy? 
j=! 


Theorem 
12.2.2 
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Proof From Equation 12.2.1, 
‘ > 2 > 2 
E(SSTR) =) \njE[(¥,;-u) ]-nE[(V..-1)] 


j=l 


Since yw is the mean of Y_, then E[(Y.. — u) | =o7/n. Also, 


E[(¥ —#)°]=Var (Vj —u) + [Bj - oP 


by Theorem 3.6.1. But Theorem 3.6.2 implies that 
Var (Yj _ LL) = Var (Yj) =a7/nj 


So, E[ (Yj - u) | =o? /nj+(uj— Oe Substituting these equalities into the expres- 
sion for E(SSTR) yields 


k k 
E(SSTR) =) njo?/nj +> nj (uj — p)° —n(o"/n) 


j=l j=l 
or 
k 
E(SSTR) = (k — 1)0° + nj (uj — wy? 
j=l 
Testing Ho: wy, = lz =... = Ly When 07 Is Known 


Theorem 12.2.1 suggests that SSTR can be the basis for a test of the null hypothesis 
that the treatment level means are all equal. When the j1;’s are the same, E(SSTR) = 
(k — 1)o?. If the true means are not all equal, E(SSTR) will be larger than (k — 1)o?. 
It follows that we should reject Ho if SSTR is “significantly large.” Of course, to 
determine the exact location of the rejection region for a given a, we need to know 
the pdf of SSTR, or some function of SSTR, when Hp is true. 


When Ho: W, =f. =... = Uy ts true, SSTR/o? has a chi square distribution with k — 1 
degrees of freedom. 


Proof The theorem can be proved directly at this point by an application of Fisher’s 
Lemma, similar to the approaches taken in Appendices 7.A.2 and 11.A.2. Rather 
than repeat those arguments, we will give a moment-generating function derivation 
in Appendix 12.A.2. 


If a, then, is the level of significance, and if o* is known, we should reject 
Ho: 4 = 2 =... = [x in favor of Hj: Not all the juj’s are equal if SSTR/o” > x?_4 4_1- 
In practice, though, comparing a set of j1;’s is seldom that easy because o? is rarely 
known. Almost invariably, o? needs to be estimated; doing so changes both the 
nature and the distribution of the test statistic. 
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Theorem 
12.2.3 


Theorem 
12.2.4 


Testing Ho: @, = lz =... = fy When o7 Is Unknown 


We know that each of the k samples can provide an independent, unbiased estimate 
for o7 (recall Example 5.4.4 and see the following discussion). Using the notation of 
Table 12.1.1, the jth sample variance is written 


Multiplying each Si by n; — 1 and summing over j gives the numerator of the obvi- 


ous “pooled” estimator for o” (recall the way a was defined in the two-sample rt 
test). We call this quantity the error sum of squares, or SSE: 


SSE= Yo/-s Solr ii 


j=1 i=1 


Whether or not Ho: 11 = U2 =... = Uy is true, 


I. SSE/o* has a chi square distribution with n — k degrees of freedom. 
2. SSE and SSTR are independent. 


Proof By Theorem 7.3.2, (nj —1)S;/o has a chi square distribution with nj — 1 
degrees of freedom. By the addition property, then, of the chi square distribution, 


k 
SSE/o? is a chi square random variable with )~ (n; — 1) =n—k degrees of freedom. 
j=l 
Each Si is independent of Y ; for i 4 j because the underlying samples are inde- 
pendent. Also, each Si is independent of Y ; by Theorem 7.3.2. Therefore, SSE and 
SSTR are independent. 


If we ignore the treatments and consider the data as one ss ie en the vari- 
ation about the parameter y can be estimated by the double sum > 3 (Yi; -Y.)°. 


j=li=l1 
This quantity is known as the total sum of squares and denoted SSTOT. 


If n observations are divided into k samples of sizes n,,nz,..., and nx, 


SSTOT = SSTR + SSE 


Proof 


ny 


SSTOT =~ 3 (¥%j;-Y.)°=>> +(%j-¥)) (129) 


j=l i=l j=l i=l 


Expanding the right-hand side of Equation 12.2.2 gives 


yee v J > 3s (Yij — a 


j=l i=l j=l i=l 


Theorem 
12.2.5 
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since the cross-product term vanishes: 


Therefore, 


nj nj 


VD Oer y= Gicny sy 


j=l i=l j=l i=l 


That is, SSTOT = SSTR + SSE. 


Suppose that each observation in a set of k independent random samples is nor- 
mally distributed with the same variance, o7. Let [1,, [42,..., and jz be the true means 
associated with the k samples. Then 


a. If Ho: 4) = U2 =... = Ly is true, 
__ SSTR/(k — 1) 
~ SSE/(n—k) 
has an F distribution with k — 1 and n —k degrees of freedom. 
b. At the a level of significance, Ho: 41 = 2 =... = We Should be rejected if F > 
F\—a,k-1,n—k- 


Proof By Theorem 12.2.3, SSTR and SSE are independent. We also know that 
SSTR/o? and SSE/c? are chi square random variables. Part (a), then, follows from 
the definition of the F distribution. 

To justify the location of the critical region cited in part (b), we need to examine 
the behavior of the proposed test statistic when H, is true. From Theorem 12.2.1, we 
know the expected value of the numerator of F: 


1 k 
E[SSTR/(k — 1)]=o07 + Eel 2, (uj — 1)? (12.2.3) 


Moreover, from Theorem 12.2.3 it follows that the expected value of the denomina- 
tor of the test statistic—that is, EL[SSE/(n — k)]—is 0”, regardless of which hypothesis 
is true. 

Now, if Ho is true, the expected values of both the numerator and the denom- 
inator of F will be co’, so the ratio is likely to be close to 1. If H, is true, though, 
the expected value of SSTR/(k — 1) will be greater than the expected value of 
SSE/(n — k), implying that the observed F ratio will tend to be larger than 1. The 
critical region, therefore, should be in the right-hand tail of the Fy_-1,,-x distribution. 

SSTR/(k — 1) _ 
~ SSE/(n—k) ~ 


That is, we should reject Ho: wi = Wo =... = py if F ery eee 
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Figure 12.2.1 


ANOVA Tables 


Computations for carrying out analyses of variance are typically presented in the 
form of ANOVA tables. Highly structured, these tables are especially helpful in 
identifying the various test statistics that arise in connection with complicated exper- 
imental designs. Figure 12.2.1 shows the format of the ANOVA table for testing 
Ao: f= 2 =... = Mk. 

The rows in any ANOVA table correspond to the sources of variation singled 
out in an observation’s model equation. More specifically, the last row always refers 
to the data’s total variation (as measured by SSTOT); the preceding rows corre- 
spond to the variations whose sum yields the total variation. For this particular 
experimental design, the three rows are reflecting the fact that 


SSTR + SSE = SSTOT 


Source df SS MS F P 
Treatment k- 1 SSTR MSTR MOER  P(Fy_1.n-% > observedF) 
Error n-k SSE MSE 

Total n-—1 SSTOT 


Next to each “source” is the number of degrees of freedom (df) associated with 
its sum of squares. Note that the df for total is the sum of the degrees of freedom for 
treatments and error (n—1=k—1+n-—k). 

The SS column lists the sum of squares associated with each source of 
variation—here, either SSTR, SSE, or SSTOT. The MS, or mean square, column is 
derived by dividing each sum of squares by its degrees of freedom. The mean square 
for treatments, then, is given by 


SSTR 
MSTR = —— 
k-1 
and the mean square for error becomes 
SSE 
MSE = 
n—k 


No entry is listed as being the mean square for total. 
The entry in the top row of the F column is the value of the test statistic: 
ee MSTR _ SSTR/(k — 1) 
~ MSE SSE/(n—k) 
The final entry, also in the top row, is the P-value associated with the observed F. If 
P <a, of course, we can reject Ho: uw) = U2 =... = px at the a level of significance. 


Case Study 12.2.1 


Generations of athletes have been cautioned that cigarette smoking retards per- 
formance. One measure of the truth of that warning is the effect of smoking 
on heart rate. In one study (73) examining that impact, six each of nonsmok- 
ers, light smokers, moderate smokers, and heavy smokers undertook sustained 
physical exercise. Their heart rates were measured after resting for three 


(Continued on next page) 
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minutes. The results appear in Table 12.2.1. Are the differences among the Y ;’s 
statistically significant? That is, if w1, “2, 43, and w4 denote the true average 
heart rates for the four groups of smokers, can we reject Ho: “4; = M2 = 3 = hg? 


Let a=0.05. For these data, k=4 andn=24, so Ho: 1) = U2 = 3 = 14 Should 


be rejected if 
SSTR/(4—1) 
= —______“" > Fy -124-4=F =3.10 
SSE/Q4—4 ="! 0.05,4—1,24—4 = £'95,3,20 
(see Figure 12.2.2). 


Area = 0.05 


Reject H, 


Figure 12.2.2 


The overall sample mean, Y_, is given by 


Y= 


24 


i 374 + 379 + 430+ 490 
Ss 
n 

j=l 


= 69.7 


Therefore, 


490 


Table 12.2.1 
Nonsmokers LightSmokers ModerateSmokers Heavy Smokers 
69 55 66 
52 60 81 
71 78 70 
58 58 77 
59 62 57 
65 66 79 
Tj 374 379 430 
Ys 62.3 63.2 71.7 


81.7 


4 
SSTR= J“ nj(¥.;-Y.)° =6[(62.3 — 69.7)? +--+ + (81.7 - 69.77] 
j=l 


= 1464.125 


(Continued on next page) 
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Example 


12.2.1 


(Case Study 12.2.1 continued) 


Similarly, 
4 6 
SSE= (¥;; Yj)’ =[(69 — 62.3)? + «+--+ (65 — 62.3)?] 
eee Pe {OLB 4s PCB)" 
= 1594.833 


The observed test statistic, then, equals 6.12: 


a. 1464.125/(4 — 1) ae 
1594.833/(24 — 4) 
Since 6.12 > F'95,3,20 = 3.10, Ao: “1 = 2 = 43 = [44 Should be rejected. These data 
support the contention that smoking influences a person’s heart rate. 
Figure 12.2.3 shows the analysis of these data summarized in the ANOVA 
table format. Notice that the small P-value (= 0.004) is consistent with the 
conclusion that Hp should be rejected. 


Source df SS MS F P 


Treatment 3 1464125 488.04 6.12 0.004 
Error 20 1594.833 79.74 
Total 23 3058.958 


Figure 12.2.3 


Computing Formulas 


There are easier ways to compute an F statistic than by using the “defining” 
formulas for SSTR and SSE. Let C = T?/n. Then 


k nj 
SSTOT =) °° ¥7,-—C (12.2.4) 
j=l i=! 
SSTR=)°\—-—C (12.2.5) 
pa 


and, from Theorem 12.2.4, 
SSE = SSTOT — SSTR 
(The proofs of Equations 12.2.4 and 12.2.5 are left as exercises.) 


For the data in Table 12.2.1, 
C=T?/n= (374+ 379 + 430 + 490)? /24 = 116,622.04 


and 


Y2 = (69)? + (52)? +--- + (84)? = 119,681 


6 
ty 
=1 


j=li 


in which case 
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4 6 
SSTOT = )° YY, —C = 119,681 — 116,622.04 = 3058.96 


j=l i=l 


Also, 
4 


SSTR = x T> — C = (374) /6 + (379)? /6 + (430)? /6 + (490)? /6 — 116,622.04 


j=l 
= 1464.13 


sO 


SSE = SSTOT — SSTR = 3058.96 — 1464.13 = 1594.83 


Notice that these sums of squares have the same numerical values that were found 
earlier in Case Study 12.2.1 using the original formulas for SSTOT, SSTR, and SSE. 


Questions 


12.2.1. The following are the gas mileages recorded dur- 
ing a series of road tests with four new models of Japanese 
luxury sedans. Test the null hypothesis that all four mod- 
els, on the average, give the same mileage. Let w = 0.05. 
Will the conclusion change if a =0.10? 


Model 
A B Cc D 
22 28 29 23 
26 24 32 24 
29 28 


12.2.2. Mount Etna erupted in 1669, 1780, and 1865. 
When molten lava hardens, it retains the direction of the 
Earth’s magnetic field. Three blocks of lava were exam- 
ined from each of these eruptions and the declination of 
the magnetic field in the block was measured (170). The 
results are given in the following table. Do these data 
suggest that the direction of the Earth’s magnetic field 
shifted over the time period spanned by the eruptions? Let 
a =0.05. 


1669 1780 1865 
57.8 57.9 52.7 
60.2 55.2 53.0 
60.3 54.8 49.4 


12.2.3. An indicator of the value of a stock relative to its 
earnings is its price-earnings ratio: the average of a given 
year’s high and low selling prices divided by its annual 


earnings. The following table provides the price-earnings 
ratios for a sample of thirty stocks, ten each from the 
financial, industrial, and utility sectors of the New York 
Stock Exchange. Test at the 0.01 level that the true mean 
price-earnings ratios for the three market sectors are the 
same. Use the computing formulas on p. 604 to find SSTR 
and SSE. Use the ANOVA table format to summarize the 
computations; omit the P-value column. 


Financial Industrial Utility 
71 26.2 14.0 
9.9 12.4 15.5 
8.8 15.2 11.9 
8.8 28.6 10.9 

20.6 10.3 14.3 
7.9 9.7 11.0 
18.8 12.5 9.7 
17.7 16.7 10.8 
15.2 19.7 16.0 
6.6 24.8 11.3 


12.2.4. Each of five varieties of corn are planted in three 
plots in a large field. The respective yields, in bushels per 
acre, are in the following table. 


Variety 1 Variety2 Variety3 Variety4 Variety 5 


46.2 49.2 60.3 48.9 52.5 
51.9 58.6 58.7 51.4 54.0 
48.7 57.4 60.4 44.6 49.3 


606 Chapter 12 The Analysis of Variance 


Test whether the differences among the average yields are 
statistically significant. Show the ANOVA table. Let 0.05 
be the level of significance. 


12.2.5. Three pottery shards from four widely scattered 
and now-extinct Native American tribes have been col- 
lected by a museum. Archaeologists were asked to esti- 
mate the age of the shards. Based on the results shown in 
the following table, is it conceivable that the four tribes 


12.2.7. Fill in the entries missing from the following 
ANOVA table. 


Source df SS MS F 


Treatment 4 6.40 
Error 10.60 
Total 377.36 


were contemporaries of one another? Let a =0.01. 


12.2.8. Do the following data appear to violate the 
assumptions underlying the analysis of variance? Explain. 


Estimated Ages of Shards (years) 


Treatment 
Lakeside Deep Gorge Willow Ridge Azalea Hill 
A B C D 
1200 850 1800 950 
800 900 1450 1200 16 4 26 8 
950 1100 1150 1150 17 12 22 9 
16 2 23 11 


17 26 24 8 


12.2.6. Recall the teachers’ expectation data described in 


Question 8.2.7. Let 4; denote the true average IQ change 
associated with group j, j = I, I, or II. Test Ho: uy = wn = 
/4m versus H,: not all w;’s are equal. Let a =0.05. 


Example 


12.2.2 


12.2.9. Prove Equations 12.2.4 and 12.2.5. 
12.2.10. Use Fisher’s Lemma to prove Theorem 12.2.2. 


Comparing the Two-Sample t Test with the Analysis of Variance 


The analysis of variance was introduced in Section 12.1 as a k-sample extension of 
the two-sample test. The two procedures overlap, though, when k is equal to 2. An 
obvious question arises: Which procedure is better for testing Ho: wx = “y? The 
answer, as Example 12.2.2 shows, is “neither.” The two test procedures are entirely 
equivalent: If one rejects Ho, so will the other. 


Suppose that X;, X2,..., X, and Y1, Yo,..., Yn are two sets of independent, normally 
distributed random variables with the same variance, o7. Let zy and jzy denote their 
respective means. Show that the two-sample t test and the analysis of variance are 
equivalent for testing Hp: wx = wy. 

If Hp were tested using the analysis of variance, the observed F ratio would be 


_ -SSTR/K—1) SSTR 


~ SSE/(‘n+m—k) SSE/(n+m—2) 


(12.2.6) 


and it would have 1 and n + m — 2 degrees of freedom. The null hypothesis would 
be rejected if F > Fi—ot.ntm—2- 

To compare the ANOVA decision rule with a two-sample ¢ test requires that 
SSTR and SSE be expressed in the “X and Y” notation of ¢ ratios. First, note that 


In this case, Y. = 


n+m 
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(nX + mY), so 


_ nm? mn eo 9 
=| et | ) 


nm 


=" (x _7) 


~n+tm 


Also, 


SSE = (n; — 1)S? + (m2 — 1) S3 


= (n— 1)Sy + (m—1)S} 
=(n+m— 2)S2 


Substituting these expressions for SSTR and SSE into the F statistic of Equa- 


tion 12.2.6 yields 
nm ae) nm —_ 2 
> 2 
(n+m—2)S; 52, a(2 _ 
ne Sp(-—+— 
(n+m-— 2) nom 


Notice that the right-hand expression in Equation 12.2.7 is the square of the 
two-sample t statistic described in Theorem 9.2.2. Moreover, 


a=P(T< —le/2.n+m—2 OF T= to /2,n-+m—2) = P(T? 2 fears) 


a 


— 2 
_ P(Fintm—2 2 Lae) 


But the unique value c such that P(Fintm—-2 > 0) = a@ IS C= Fi_e1.ntm—2, SO 
2 
Fy —a,1,nt¢m—2 = ty /2,n+m—2° Thus, 
a= P(T = a/2,n+m—2 or T = bo /2,n-+m—2) _ P(F = F\~«,1,n+m—2) 


It follows that if one test statistic rejects Hp at the a level of significance, so will the 


other. 


Questions 


12.2.11. Verify the conclusion of Example 12.2.2 by doing 
at test and an analysis of variance on the data of Ques- 
tion 9.2.8. Show that the observed F ratio is the square 
of the observed ¢ ratio and that the F critical value is the 
square of the f critical value. 


12.2.12. Do an analysis of variance on the Mark Twain— 
Quintus Curtius Snodgrass data of Case Study 9.2.1. 


Verify that the observed F ratio is the square of the 
observed t ratio. 


12.2.13. Do an analysis of variance and a pooled two- 
sample t test on the motorcycle data given in Ques- 
tion 8.2.2. How are the observed F ratio and observed 
t ratio related? How are the two critical values related? 
Assume that a = 0.05. 
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12.3 Multiple Comparisons: Tukey’s Method 


The suspicion that smoking affects heart rates was borne out by the analysis done 
in Case Study 12.2.1. In retrospect, the fact that Ho: w1 = Wo = 3 = 4 Was rejected 
is not surprising, given the sizeable range in the Y ;’s (from 62.3 for nonsmokers 
to 81.7 for heavy smokers). But not all the treatment groups were far apart: The 
heart rates for nonsmokers and light smokers were fairly close—62.3 versus 63.2. 
That raises an obvious question: Is there some way to follow up an initial test of 
Ao: 1 = Wo =... = ee by looking at subhypotheses—that is, can we test hypotheses 
that involve fewer than the full set of population means (for example, Ho: “1 = 12)? 

The answer is “yes,” but the solution is not as simple as it might appear at first 
glance. In particular, it would be inappropriate to do a series of standard two-sample 
t tests on different pairs of means—for example, applying Theorem 9.2.1 to j1; ver- 
SUS [2, then to f42 versus j13, and so on. If each of those tests was done at a certain 
level of significance a, the probability that at least one Type I error would be com- 
mitted would be much larger than a. That being the case, the “nominal” value for a 
misrepresents the collective precision of the inferences. 

Suppose, for example, we did ten independent tests of the form Ho: j4; = 4; ver- 
sus H): uu; # yj, each at level a = 0.05, on a large set of population means. Even 
though the probability of making a Type I error on any given test is only 0.05, the 
chances of incorrectly rejecting a true Ho with at least one of the ten ¢ tests increases 
dramatically to 0.40: 


P(at least one Type I error) = 1 — P(no Type I errors) 
= 1—(0.95)!° 
= 0.40 


Addressing that concern, mathematical statisticians have paid a good deal of 
attention to the so-called multiple comparison problem. Many different procedures, 
operating under various sets of assumptions, have been developed. All have the 
objective of keeping the probability of committing at least one Type I error small, 
even when the number of tests performed is large (or even infinite). In this section, 
we develop one of the earliest of these techniques, a still widely used method due to 
John Tukey. 


A Background Result: The Studentized Range Distribution 


The simplest multiple comparison problem is to test the equality of all pairs of indi- 
vidual means— that is, to test with one procedure Ho: uw; = pj versus Hy: 4; A 14; for 
alli & j. In Tukey’s method, these tests are performed using confidence intervals for 
4; — ;. The derivation depends on knowing the probabilistic behavior of the ratio 
R/S, where R is the range of a set of normally distributed random variables, and S 
is an estimator for their true standard deviation. 


Definition 12.3.1. Let W,, W2,..., and W; be a set of k independent, normally 
distributed random variables with mean yw and variance o7, and let R denote 
their range: 


R=max W; — min W; 
l l 


Figure 12.3.1 


Theorem 
12.3.1 
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Suppose S? is based on a chi square random variable with v degrees of freedom, 
independent of the W,’s, where E(S?) = 07. The studentized range, Qx.y, is the 
ratio 


Table A.5 in the Appendix gives values of Qy,x,, the 100(1 — a@)th percentile of 
Qk, for a =0.05 and 0.01, and for various values of k and v. For example, if k = 4 


R 
and v = 8, Q 05.4.3 = 4.53, meaning that P (5 > 4.53) = 0.05, where R is the range 


of four normally distributed random variables, whose true standard deviation, o, is 
being estimated by a sample standard deviation, S, having 8 degrees of freedom (see 
Figure 12.3.1). (Note: For the applications of the studentized range in this chapter, 
S? will always be MSE and v will be n — k.) 


fo. ©) 


Y4.8 


Area = 0.05 


© Probability density 


Let Y _;, j=1,2,...,k be the k sample means in a completely randomized one-factor 
design. Let nj =r be the common sample size, and let y; be the true means, j = 


1,2,...,k. The probability is 1 — that all (5) differences j4; — j4; will simultaneously 
satisfy the inequalities 
Y;—Y.j; — DVMSE < 4; — 4; < Yi —Y.; + DVMSE 


where D= Qa k.rk_z//7. If, for a given i and j, zero is not contained in the preced- 
ing inequality, Ho: 4; = j4; can be rejected in favor of Hy: i A 4;, at the a level of 
significance. 


Proof Let W;=Y, —;. Then W, is normally distributed with mean zero and vari- 
ance o7/r. Let max W, and min W, denote the maximum and minimum values, 
respectively, for W,, where t ranges from 1 to k. 


Take MSE/r to be the estimator for o”/r. From the definition of the studentized 


W, — min W, 
range, — has a Qx.+x-« pdf, which implies that 


r 


max W, — min W, 


P | ——=—— < Qak,rk-k | =1—a 
[MSE 
or, equivalently, 
P(max W, — min W, < DVMSE) =1—a (12.3.1) 


where D= Ou,k,rk—k/VT- 
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Now, if Equation 12.3.1 is true, it must also be true that 


P(|W; — W;| < DVMSE) =1—a_ forall iand j (12.3.2) 
Rewriting Equation 12.3.2 gives 
P(—DVMSE < W; — W; < DVMSE) =1—a for all iand j (123.3) 


Recall that W, = Y ,; — 1;. Substituting the latter for W; and W; into Equation 12.3.3 
yields the statement of the theorem: 


P(Y;—Y;— DVMSE <p; — 4) <Y; —Y j + DVMSE)=1-a 


for alli and j. 


Case Study 12.3.1 


A certain fraction of antibiotics injected into the bloodstream are “bound” to 
serum proteins. This phenomenon bears directly on the effectiveness of the 
medication, because the binding decreases the systemic uptake of the drug. 
Table 12.3.1 lists the binding percentages in bovine serum measured for five 
widely prescribed antibiotics (214). Which antibiotics have similar binding 
properties, and which are different? 


Table 12.3.1 


Penicillin Tetra- Strepto- Erythro- Chloram- 
G cycline mycin mycin phenicol 
29.6 27.3 5.8 21.6 29.2 
24.3 32.6 6.2 17.4 32.8 
28.5 30.8 11.0 18.3 25.0 
32.0 34.8 8.3 19.0 24.2 
T; 114.4 125.5 31.3 76.3 111.2 
Y; 28.6 31.4 78 19.1 27.8 
To answer that question requires that we make all >) = 10 pairwise com- 
parisons of jy; versus j1;. First, MSE must be computed. From the entries in 
Table 12.3.1, 


5 4 
SSE= ~~ (Yj — Yj)’ = 135.83 
j=l i=l 
so MSE = 135.83/(20 — 5) = 9.06. Let a = 0.05. Since n — k = 20 —5 = 15, the 
appropriate cutoff from the studentized range distribution is Q 05,5,15 = 4.37. 
Therefore, D = 4.37/./4 =2.185 and D\/MSE = 6.58. 

For each different pairwise subhypothesis test, H:u; = mj; versus 
My: 4; A [4j, Table 12.3.2 lists the value of Y; — Y_;, together with the corre- 
sponding 95% Tukey confidence interval for 4; — 4; calculated from Theorem 
12.3.1. As the last column indicates, seven of the subhypotheses are rejected 
(those whose Tukey intervals do not contain zero) and three are not rejected. 


(Continued on next page) 
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Table 12.3.2 

Pairwise Difference Yeo j Tukey Interval Conclusion 
[1 — [2 —2.8 (—9.38, 3.78) NS 
Ly — 3 20.8 (14.22, 27.38) Reject 
My — a 9.5 (2.92, 16.08) Reject 
[Ly — Us 0.8 (—5.78, 7.38) NS 
[2 — [3 23.6 (17.02, 30.18) Reject 
[Ly — [4 12.3 (5.72, 18.88) Reject 
[a — [bs 3.6 (—2.98, 10.18) NS 
[3 — [4 —11.3 (—17.88, —4.72) Reject 
[3 — [bs —20.0 (—26.58, —13.42) Reject 
[ba — [bs —8.7 (—15.28, —2.12) Reject 


Questions 


12.3.1. Use Tukey’s method to make all the pairwise com- 
parisons for the heart rate data of Case Study 12.2.1 at the 
0.05 level of significance. 


12.3.2. Construct 95% Tukey intervals for the three pair- 
wise differences, jz; — z;, for the data of Question 12.2.3. 


12.3.3. Intravenous infusion fluids produced by three dif- 
ferent pharmaceutical companies (Cutter, Abbott, and 
McGaw) were tested for their concentrations of particu- 
late contaminants. Six samples were inspected from each 
company. The figures listed in the table are, for each sam- 
ple, the number of particles per liter greater than five 
microns in diameter (183). 


Number of Contaminant Particles 


Cutter Abbott McGaw 
255 105 577 
264 288 515 
342 98 214 
331 275 413 
234 221 401 
217 240 260 


Do the analysis of variance to test Ho: tc = Ma = my and 
then test each of the three pairwise subhypotheses by 
constructing 95% Tukey intervals. 


12.3.4. Construct 95% Tukey intervals for all ten pairwise 
differences, ; — 4;, for the data of Question 12.2.4. Sum- 
marize the results by plotting the five sample averages on 
a horizontal axis and drawing straight lines under varieties 
whose average yields are not significantly different. 


12.3.5. Construct 95% Tukey confidence intervals for 
the three pairwise differences associated with the mur- 
der culpability scores described in Question 8.2.15. Which 
differences are statistically significant? 


12.3.6. If 95% Tukey confidence intervals tell us to reject 
Ho: 4 = 2 and Apo: ui = 3, will we necessarily reject 
Ap: [2 = 3? 


12.3.7. The width of a Tukey confidence interval is 
2VMSE Quins iy / 2 


If k increases, but = and MSE stay the same, will the 


Tukey intervals get shorter or longer? Justify your answer 
intuitively. 


12.4 Testing Subhypotheses with Contrasts 


There are two general ways to test a subhypothesis, the choice depending, strangely 
enough, on when Ho can be fully specified. If a researcher wishes to do an experi- 
ment first, and then let the results suggest a suitable subhypothesis, the appropriate 
analysis is any of the various multiple comparison techniques—for example, the 
Tukey method of Section 12.3. 
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If, on the other hand, physical considerations, economic factors, past experi- 
ence, or any other factors suggest a particular subhypothesis before any data are 
taken, Ho can best be tested using a contrast. The advantage of the latter is that tests 
based on contrasts have greater power than the analogous tests based on a multiple 
comparison procedure would have. 


Definition 12.4.1. Let p11, 2,..., 4% denote the true means of k factor levels 
being sampled. A linear combination, C, of the j1;’s is said to be a contrast if the 
k 
sum of its coefficients is 0. That is, C is a contrast if C = }° cj;, where the c;’s 
j=l 


k 
are constants such that }> c; =0. 
j=l 


Contrasts have a direct connection with hypothesis tests. Suppose a set of data 
consists of five treatment levels, and we wish to test the subhypothesis Ho: 41 = [42. 
The latter could also be written Ho: 4; — 42 =0, which is actually a statement about 
a contrast— specifically, the contrast C, where 


C=h1— M2 = (Dai + (-)D pz + (0) 3 + (0) 4 + (0) 5 


Or, suppose in Case Study 12.3.1, there was a good pharmacological reason for com- 
paring the average level of serum binding for the first two antibiotics to the average 
level for the last three. Written as a subhypothesis, the statement of no difference 
would be 

Mit, — b3+ hat bs 

a a 3 


Ho 


As a contrast, it becomes 


1 1 1 1 1 

C= ahi + oa 3h 

In both these cases, the numerical value of the contrast will be 0 if Hp is true. This 

suggests that the choice between Hp and H; can be accomplished by first estimat- 

ing C and then determining, via a significance test, whether that estimate is too far 
from 0. 

We begin by considering some of the mathematical properties of contrasts and 

their estimates. Since Y ; is always an unbiased estimator for j1;, it seems reason- 


able to estimate C, a linear combination of population means, with C, a linear 
combination of sample means: 


(The coefficients appearing in C, of course, are the same as those that defined C.) It 
follows that 


and 
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Comment Replacing the unknown error variance, o7, by its estimate from the 
ANOVA table— MSE — gives a formula for the estimated variance of the estimated 
contrast: 


The sampling behavior of C is easily derived. By Theorem 4.3.3, the normality 
of the Y;;’s ensures that C is also normal, and by the usual Z transformation, the 
ratio 


C-E(C)_ C-C 
Y, Var(C) f Var(C) 


is a standard normal. Therefore, 


m 2 
c-C 


4/ Var(C) 


is a chi square random variable with 1 degree of freedom. Of course, if Ho: uw; = 
[2 =... =p Is true, C is 0, and the ratio reduces to 


C2 


ay 


j=l 


|S 


ies 


3 


J 


One additional property of contrasts is worth noting because of its connection 
to the treatment sum of squares in the analysis of variance. Two contrasts 


k k 
C= oerjpy and C2= > cnjpy 
jal j=l 
are said to be orthogonal if 
s Cy, jC2j 
yu <0 
ja 


Similarly, a set of g contrasts, {C;}7_,, are said to be mutually orthogonal if 


fi) 9 foralls £1 


nj 


j=l 


(The same definitions apply to estimated contrasts.) 

Definition 12.4.2 and Theorems 12.4.1 and 12.4.2, both stated here without 
proof, summarize the relationship between contrasts and the analysis of variance. 
In short, the treatment sum of squares can be partitioned into k — 1 “contrast” sums 
of squares, provided the contrasts are mutually orthogonal. 
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Theorem 
12.4.1 


Theorem 
12.4.2 


k 
Definition 12.4.2. Let C; = }¢ ciju; be any contrast. The sum of squares 
j=l 


associated with C; is given by 


SSc, = 


k 2. 


where C; = > cijY 


j=l 


nj 


SSTR = > Yi v. 


j=l i=l 


= SSc, + SSc, +++: + SSc,_, 


Let C be a contrast having me same eoEiicicns as the subhypothesis Ho:c\4 + 


Copa ++++ +c, 4, =0, where > cj =0. Letn= > nj; be the total sample size. Then 
j=! j=l 


SSc/1 
a. F= ae has an F distribution with | and n —k degrees of freedom. 
SSE/(n —k) 


b. Ho: cypey + Copa +--+ + CK 4~ = 0 should be rejected at the a level of significance if 


PF = Fie tsn—k 


Comment Theorem 12.4.1 is not meant to imply that only mutually orthogonal 
contrasts can, or should, be tested. It is simply a statement of a partitioning rela- 
tionship that exists between SSTR and the sum of squares for mutually orthogonal 
C;’s. In any given experiment, the contrasts that should be singled out are those the 
experimenter has some prior reason to test. 


Case Study 12.4.1 


As a rule, infants are not able to walk by themselves until they are almost 
fourteen months old. One study, however, investigated the possibility of reduc- 
ing that time through the use of special “walking” exercises (212). A total of 
twenty-three infants were included in the experiment—all were one-week-old 
white males. They were randomly divided into four groups, and for seven weeks 
each group followed a different training program. Group A received special 


(Continued on next page) 
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walking and placing exercises for twelve minutes each day. Group B also had 
daily twelve-minute exercise periods but was not given the special walking and 
placing exercises. Groups C and D received no special instruction. The progress 
of groups A, B, and C was checked every week; the progress of group D was 
checked only once, at the end of the study. 

After seven weeks the formal training ended and the parents were told 
they could continue with whatever procedure they desired. Table 12.4.1 lists the 
ages (in months) at which each of the twenty-three children first walked alone. 
Table 12.4.2 shows the analysis of variance computations. Based on 3 and 19 
degrees of freedom, the a = 0.05 critical value is 3.13, so Ho: wa = Up =LC=LMd 
is not rejected. 


Table 12.4.1 Age When Infants First Walked Alone (Months) 
Group A Group B Group C Group D 
9.00 11.00 11.50 13.25 
9.50 10.00 12.00 11.50 
9.75 10.00 9.00 12.00 
10.00 11.75 11.50 13.50 
13.00 10.50 13.25 11.50 
9.50 15.00 13.00 
Tj; 60.75 68.25 70.25 61.75 
Y; 10.12 11.38 11.71 12.35 


Table 12.4.2. ANOVA Computations 


Source df SS MS F 
Exercises 3 14.77 4.92 2.14 
Error 19 43.70 2.30 

Total 22 58.47 


At this point the analysis could end with the overall Ho not being rejected. 
We will continue with the subhypothesis procedures, however, to illustrate the 
application of Theorem 12.4.2. 


Recall that groups A and B spent equal amounts of time exercising but 
followed different regimens. Consequently, a test of Ho: w4 = 4g Versus Hy: w, 4 
{4g would be an obvious way to assess the effectiveness of the special walking 
and placing exercises. The associated contrast would be Cy = w4 — jug. Similarly, 
a test of Ho: 4c = Up (using C2 = 4c — Wp) would provide an evaluation of the 
psychological effect of periodic progress checks. 

From Definition 12.4.2 and the data in Table 12.4.1, 


[ (5) ‘ (*2)] 
SSc, = 2 = 4.68 


rey 


6 6 


(Continued on next page) 
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(Case Study 12.4.1 continued) 


70.2 1.75\]? 
PEE) CS)] 
SSo, = = 1:12 


PCy 


6 5 
Dividing these sums of squares by the mean square for error (= 2.30) gives F 
ratios of 4.68/2.30 = 2.03 and 1.12/2.30 = 0.49, neither of which is significant at 
the a = 0.05 level (F95.1.19 = 4.38) (see Table 12.4.3). 


and 


Table 12.4.3 Subhypothesis Computations 


Subhypothesis Contrast SS F 
Ao: La = Lp C; =a — Le 4.68 2.03 
Ao: kc = Lp Cy =Uc — Lp 1.12 0.49 


Questions 


12.4.1. The cathode warm-up time (in seconds) was deter- Question 12.2.4 is the same as the average for the last two. 
mined for three different types of X-ray tubes using fifteen Let a=0.05. 


observations of each type. The results are listed in the 12.4.3. In Case Study 12.2.1 test the hypothesis that the 
following table. ; 
average of the heart rates for light and moderate smok- 
ers is the same as that for heavy smokers. Let the level of 


Warm-Up Times (sec) significance be 0.05. 
12.4.4. Large companies have the option of limiting their 
Tube Type growth, but does doing so lead to higher profitability? 


The table below gives the profitability for a sample of 


A B c twenty-one top-ranked companies, where profitability is 
19 27 20 24 16 14 expressed in terms of annual profit as a percentage of total 
23 31 20 25 %6 18 company assets. The firms are divided into three groups 
2%6 25 32 29 15 19 by size of assets—$50 billion or less, between $51 and 
18 22 27 31 18 21 $100 billion, and over $100 billion. Test the hypothesis 
20 23 40 24 19 17 that small- and medium-size companies are as profitable 
20 27 24 25 17 19 as large companies. Let a = 0.10. 
18 29 22 32 19 18 


35 18 18 Size of Assets (billions of $) 


$50 or Less Between $51 and $100 Greater than $100 


Do an analysis of variance on these data and test the 
hypothesis that the three tube types require the same aver- 
age warm-up time. Include a pair of orthogonal contrasts 
in your ANOVA table. Define one of the contrasts so 5.7 5.3 9.2 
it tests Hy: 44 = Uc. What does the other contrast test? 


Check to see that the sums of squares associated with your 3.4 10.4 3.9 
two contrasts verify the statement of Theorem 12.4.1. an oe ea 


12.4.2. Test the hypothesis that the average of the true 
yields for the first three varieties of corn described in (Note: SSE = 147.17429) 


12.4.5. Verify that C;= H4+ Sus — Uc — 2p is orthog- 
onal to the C; and C, of Case Study 12.4.1. Find SS, and 
illustrate the statement of Theorem 12.4.1. 


12.4.6. For many years sodium nitrite has been used as a 
curing agent for bacon, and until recently it was thought 
to be perfectly harmless. But now it appears that during 
frying, sodium nitrite induces the formation of nitrosopy- 
rrolidine (NPy), a substance suspected of being a carcino- 
gen. In one study focusing on this problem, measurements 
were made of the amount of NPy (in ppb) recovered after 
the frying of three slices of four commercially available 
brands of bacon (161). Do the analysis of variance for 
the data in the table and partition the treatment sum of 
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squares into a complete set of three mutually orthogonal 
contrasts. Let the first contrast test Ho: wu, = 4g, and the 
second, Ap: (ua + Ue)/2= (Uc + Up)/2. Do all tests at the 


0.05 level of significance. 


NPy Recovered from Bacon (ppb) 


Brand 
A B C D 
20 75 15 25 
40 25 30 30 
18 21 21 31 


12.5 Data Transformations 


The three assumptions required by the analysis of variance have already been men- 
tioned: the Y;;’s must be independent, normally distributed, and have the same 
variance for all j. In practice, these three are not equally difficult to satisfy, nor 
do their violations have the same consequences for the F test. 

Independence is certainly a critical property for the Y;;’s to have, but ran- 
domizing the order in which observations are taken (relative to the different 
treatment levels) tends to eliminate systematic bias—and achieve independence — 
quite effectively. Normality is a much more difficult property to induce or even 
to verify (recall Section 10.4). Fortunately, violations of that particular assump- 
tion, unless extreme, do not seriously compromise the probabilistic integrity of the 
analysis of variance (like the ¢ test, the F test is robust against departures from 
normality). 

If the final assumption is violated, though, and the Y;;’s do not all have the same 
variance, the effect on certain inference procedures—for example, the construction 
of confidence intervals for individual means—can be more unsettling. However, it 
is possible in some situations to “stabilize” the level-to-level variances by a suitable 
data transformation. 

Suppose that Y;; has pdf fy (yij; wj),i=1,2,...,nj; j= 1,2,...,k, anda known 
function g exists for which Var(Yi;) = g(;). We wish to find a transformation, A, 
that, when applied to the Y;;’s, will generate a new set of variables having a constant 
variance —that is, A(Y;;) = W;;, where Var(W,;) = oe a constant. 

By Taylor’s theorem, 


Wij = ACs) + Yiy — BA’ (Hy) 
Of course, E(Wi;) = A(u;), Since E(¥;; — uj) =0. Also, 


Var(W;;) = E[Wi; — E(W;)P 
= E[(Yij — a) A’(upP 


=[A’ (uj) P Var(¥;;) = [A (4) Pg (4) 
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Solving for A’(j1;) gives 
VvarWi) 
Jeti) —-/g (uj) 


For Y;; in the neighborhood of j;, it follows that 


Al(uj) = 


1 
Agi) =e | dy +e (12.5.1) 
V aQis) 
Example Suppose the Y;;’s are Poisson random variables with mean w;, j =1,2,...,k, so 
12.5.1 
eM pi 


roi w= aa 
In this case, the variance is equal to the mean (recall Theorem 4.2.2): 


Var(¥ij) = E (Vij) = uj = g (ey) 


By Equation 12.5.1, then, 


A(¥ij)=c1 dyjij +2 = 2¢1,/Vij + C2 


IF 


or, letting c; = } and c, =0 to make the transformation as simple as possible, 


AY) = J¥iy (12.5.2) 


Equation 12.5.2 implies that if the data are known in advance to be Poisson, 
each of the observations should be replaced by its square root before we proceed 
with the analysis of variance. = 


Example Suppose each Y;; is a binomial random variable with pdf 
12.5.2 


n i n— 
Febsims v= ( je yi (1 — py) me 
Vij 
Here, E(¥i;) =np; = u;, which implies that 
iv 
Var(¥i;)=npj(1— pj) =H; (1- a 8 (Hj) 


It follows that the variance-stabilizing transformation for this type of data is the 
inverse sine: 


+2 


1 
Agi) =e f =~ dy 
: Vv viz — yij/n) ’ 
y..\ 12 
= c,2./narcsin (4) +c) 
n 


or, what is equivalent, 


Y, 1/2 
ACY) =aresn (7 “) 
n 


Questions 


12.5.1. A commercial film processor is experimenting 
with two kinds of fully automatic color developers. Six 
sheets of exposed film are put through each developer. 
The number of flaws on each negative visible with the 
naked eye is then counted. 


Number of Visible Flaws 


Developer A Developer B 


1 8 
4 6 
5 4 
6 9 
3 11 
7 10 


Assume the number of flaws on a given negative is a 
Poisson random variable. Make an appropriate data trans- 
formation and do the indicated analysis of variance. 


12.5.2. An experimenter wants to do an analysis of vari- 
ance on a set of data involving five treatment groups, each 
with three replicates. She has computed Y ; and S; for 
each group and gotten the results listed in the following 
table. 
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Treatment Group 


1 2 a) 4 5 


Y; 90 40 160 9.0 1.0 
S; 30 20 40 3.0 1.0 


What should the experimenter do before computing the 
various sums of squares necessary to carry out the F test? 
Be as quantitative as possible. 


12.5.3. Three air-to-surface missile launchers are tested 
for their accuracy. The same gun crew fires four rounds 
with each launcher, each round consisting of twenty mis- 
siles. A “hit” is scored if the missile lands within ten yards 
of the target. The following table gives the number of hits 
registered in each round. 


Number of Hits per Round 


Launcher A Launcher B- Launcher C 


13 15 9 
11 16 11 
10 18 10 
14 17 8 


Compare the accuracy of these three launchers by using 
the analysis of variance after making a suitable data 
transformation. Let a = 0.05. 


12.6 Taking a Second Look at Statistics (Putting the 
Subject of Statistics Together—The Contributions of 


Ronald A. Fisher) 


“The time has come,” the Walrus said 
“To talk of many things: 
Of shoes—and ships—and sealing wax 
Of cabbages—and kings. 
And why the sea is boiling hot 
And whether pigs have wings.” 


Lewis Carroll 


Statistics, as we know it today, is very much a product of the twentieth century. 
To be sure, its roots are centuries old. The Frenchmen Blaise Pascal and Pierre Fer- 
mat did their protean work on probability in 1654. At about that same time, John 
Graunt was studying Bills of Mortality in England and demonstrating a remarkable 
flair for teasing out patterns and trends. Still, as the twentieth century dawned, there 
was no real subject of statistics. There were bits and pieces of probability theory, and 
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there were more than a few extremely capable observers of random phenomena— 
Francis Galton and Adolphe Quetelet being among the most prominent—but there 
was nothing resembling any general principles or formal methodology. 

Perhaps the most serious “gap” at the turn of the century was the almost total 
lack of information about sampling distributions. No one knew, for example, the 


Y- —1)S? X-Y S} 
pdfs that described quantities such as ay aa ; , or L . These, 
S/J/n o? S/he Sy 
PVin m 


of course, turned up as test statistics in Chapters 6, 7, and 9. Not knowing their 
pdfs meant that no inferences other than point estimates could be made about the 
parameters of normal distributions. Moreover, there was very little known about 
point estimates and, more generally, about the mathematical properties that should 
be associated with the estimation process. 

Two individuals who figured very prominently in the early efforts to put statistics 
on a solid mathematical footing were Karl Pearson and W.S. Gossett (who published 
under the pseudonym “Student”). In 1900, Pearson deduced the distribution of the 
goodness-of-fit statistic, which appeared in Chapter 10. And Gossett, in 1908, came 


Y — Mo 


up with the pdf for —that is, the ¢ distribution. It was a third person, 


though, Ronald A. Fisher, who stood tallest among his peers. He not only did much 
of the early work in deriving sampling distributions and exploring the mathemati- 
cal properties of estimation, he also created the critically important area of applied 
statistics known as experimental design. 

Born in 1890 in a suburb of London, Fisher was mathematically precocious 
and particularly adept at visualizing complicated problems in his head, a talent 
that some believe he developed to compensate for his congenitally poor eyesight. 
He graduated with distinction from Cambridge in 1912, where his specialties were 
physics and optics. During his time there, he also developed what would become 
a lifelong interest in genetics. He was particularly intrigued with the possibility of 
finding a mathematical justification for Darwin’s theory of evolution. (Almost two 
decades later, he published a book on the subject, The Genetical Theory of Natural 
Selection.) 

In 1915, he derived the distribution of the sample correlation coefficient in a 
paper that is often thought to mark the beginning of the modern theory of sam- 
pling distributions. After teaching high school physics for several years (a job that 
did not seem to suit him especially well), he accepted a position as a statistician at 
the Rothamsted Agricultural Station. There he absolutely flourished as he immersed 
himself in the pursuit of both applied and mathematical statistics. Among his accom- 
plishments was a seminal paper published in 1921, “Mathematical Foundations of 
Theoretical Statistics,’ which provided the framework for generations of future 
research. 

The work at Rothamsted brought him face-to-face with the very difficult 
problem of drawing inferences from field trials where biases of various sorts (dif- 
ferent soil qualities, uneven drainage gradients, etc.) were the rule rather than the 
exception. The strategies he devised for dealing with heterogeneous environments 
eventually coalesced into what is now referred to as experimental design. Guided by 
his twin principles of replication and randomization, he revolutionized the protocol 
for setting up and conducting experiments. The mathematical techniques that sup- 
ported his ideas on experimental design became known, of course, as the analysis of 
variance. In 1925, Fisher published Statistical Methods for Research Workers, a clas- 
sic text whose many subsequent editions helped countless scientists become more 
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sophisticated in the ways of analyzing data. A decade later he wrote The Design of 
Experiments, a second highly acclaimed guide for researchers. 

Fisher was knighted in 1952, ten years before he died in Adelaide, Australia, at 
the age of seventy-two (48). 


Appendix 12.A.1 Minitab Applications 


Figure 12.A.1.1 


The Minitab command for doing the F test of Theorem 12.2.5 is 


MTB > aovoneway cl-ck 


where the Y;;’s from the k samples have been entered in columns cl through ck. The 
output appears in the ANOVA table format of Figure 12.2.1. 

Displayed in Figure 12.A.1.1 are the input and output for analyzing the 
heart rate data described in Case Study 12.2.1. The program also prints out 95% 
confidence intervals for each jz; —that is, 


_ S$. S 
Y.j —¢025,nj-1° t=, Yj + 1025,nj-1° —) 
Vi i 


where S is the pooled standard deviation calculated from all k samples. 


set cl 

69 52 71 58 59 65 
end 

set c2 

55 60 78 58 62 66 
end 

set c3 

66 81 70 77 57 79 
end 

set c4 

91 72 81 67 95 84 
end 

aovoneway ci-c4 


i> 
HH 
od 


bod 
4H 
Poa 


See See See See Ss 
4 w 
> 
VVVVVVV VV VV VY 


One-way ANOVA: C1, C2, C3, C4 


Source DF ss MS F P 
Factor 3 1464.1 488 .0 6.12 0.004 
Error 20 1594.8 79.7 

Total 23 3059.0 


S = 8.930 R-Sq = 47.86%  R-Sq (adj) = 40.04% 


Individual 95% CIs For Mean Based on 
Pooled StDev 


Level N Mean StDev  ------ HeSsSess== +--------- teSssSS=S= + 
C1 6 62,333 7.267 (------ #--- HH ) 
C2 6 63.167 8.159 (------ *--- HH ) 
c3 6 71.667 9.158 (------ #--- HH ) 
c4 6 81.667 10.764 (------ *--- AH ) 
------ 4$---------4---------4---------4+ 
60 70 80 90 


Pooled StDev = 8.930 
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Testing Ho: uw, =... =, Using Minitab Windows 


1. Enter the k samples in columns C1 through Ck, respectively. 
2. Click on STAT, then on ANOVA, then on ONE-WAY (UNSTACKED),. 
3. Type C1-Ck in RESPONSES box, and click on OK. 


Pairwise comparisons are also available in Minitab, but the Tukey method 
requires that the data be entered differently than how they were for the 
AOVONEWAY command. First, the & samples are “stacked” in a single column— 
say, cl. Then a second column, c2, is created whose entries identify the treatment 
level to which each Y;; in Column 1 belongs. For example, cl and c2 for the data 


Levell Level2 Level 3 


4 —l 6 
2 3 8 
would be 

4 1 
2 1 
—1 2 
cl= 3 and c2= > 
6 3 
8 3 


The statements 


MTB > oneway cl c2; 
SUBC > tukey. 


will then produce a complete set of 95% Tukey confidence intervals. 

Figure 12.A.1.2 shows the Minitab input, the ANOVA table output, and the 
complete set of 95% Tukey confidence intervals for the serum binding data of Case 
Study 12.3.1. Intervals not containing 0, of course, correspond to “pairwise” null 
subhypotheses that should be rejected (at the a = 0.05 level of significance). For 
example, the 95% Tukey confidence interval for 43 — jz; extends from —27.350 to 
—14.200. Since 0 is not contained in that interval, the null subhypothesis Ho: wy = 143 
should be rejected at the w =0.05 level of significance. 


Constructing Tukey Confidence Intervals Using Minitab Windows 


1. Enter entire sample in column Cl, beginning with the n, observations in 
Sample 1, followed by the nz observations in Sample 2, and so on. 

In column C2, enter n, 1’s, followed by nz 2’s, and so on. 

Click on STAT, then on ANOVA, then on ONE-WAY. 

Type Cl in RESPONSE box and C2 in FACTOR box. 

Click on COMPARISONS, then on TUKEY’S FAMILY ERROR RATE. 
Enter the desired value for 100 a. 

6. Double click on OK. 


WP Ys 


Figure 12.A.1.2 


MTB 
DATA 
DATA 
DATA 
MTB 
DATA 
DATA 
MTB 
SUBC 
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> set cl 
> 29.6 24.3 28.5 32.0 27.3 32.6 30.8 34.8 5.8 6.2 
> 11.0 8.3 21.6 17.4 18.3 19.0 29.2 32.8 25.0 24.2 
> end 

> set c2 
>11112222333344445555 
> end 

> oneway ci c2; 

> tukey. 


One-way ANOVA: C1 versus C2 


Sour 
C2 

Erro 
Tota. 


Pool 


Tuke 


ce DF ss MS F P 
4 1480.82 370.21 40.88 0.000 
ic 15 135.82 9.05 
i 19 1616.65 
3.009 R-Sq = 91.60% R-Sq(adj) = 89.36% 
Individual 95% CIs For Mean Based on 
Pooled StDev 
il N Mean StDey ----+------- PaSSaaaa= Le SeesSes fSseo== 
4 28.600 3.218 (==4=-<) 
4 31.375 3.4171 (a-=4==-) 
4 7.825 2.384 (---*---) 
4 19.075 1.806 (===#==-) 
4 27.800 3.990 (=saha==) 
----4+------- +-------- 4+-------- once 
8.0 16.0 24.0 32.0 
ed StDev = 3.009 


y 95% Simultaneous Confidence Intervals 


All Pairwise Comparisons among Levels of C2 


Individual confidence level = 99.25% 
C2 = 1 subtracted from: 
C2 ~~ Lower Center Upper. <sss===> trea cons RSeesSso== Pe Sae aaa a 
2 -3.800 2.775 9.350 (===$===) 
3 -27.350 -20.775 -14.200 (s==#===) 
4 -16.100 -9.525 -2.950 (=s=6==5) 
5 -7.375 -0.800 5.775 (---*---) 

------- 4$--------4---------4---------+ 

-16 (0) 16 32 

C2 = 2 subtracted from: 
C2 Lower Center Upper. esss=s5 sa aaa ea Fase sosss tices asasA a 
3 -30.125 -23.550 -16.975 (===#-==) 
4 -18.875 -12.300 -5.725 (==s8-=+) 
5 -10.150 -3.575 3.000 (---*---) 

------- 4$--------4---------4---------+ 

-16 (0) 16 a2 
C2 = 3 subtracted from: 
C2 Lower Center Upper 
4 4.675 11.250 17.825 
5 13.400 19.975 26.550 
32 
C2 = 4 subtracted from: 
C2 Lower Center Upper = ------- tassSscas Fassscaa= Hacsaasss + 
5 2.150 8.725 15.300 (---*---) 
------- $--------4---------4---------4+ 
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Appendix 12.A.2 A Proof of Theorem 12.2.2 


To prove that SSTR/o* has a chi square distribution with k — 1 degrees of freedom, it 


(k-1)/2 
suffices to show that the moment-generating function of SSTR/o? is ( i =) : 


Note, first, that under the null hypothesis that w= w2=...= [z, 
SSTOT = (n — 1)S?” 


where S? is the sample variance of a set of n observations from a normal distribution. 
Therefore, by Theorem 7.3.2, 


1 (n—1)/2 
Mssrorjo2(t) = (=) 


Also, from Theorem 12.2.3, SSE/o? is a chi square random variable with n — k 
degrees of freedom, so 


1 \ eer 
Mssejo2(t) = ( ia =) 


Since SSTOT/o* is the sum of two independent random variables, SSTR/o? and 
SSE/o?, it follows that 


Msstor/o2(t) = Msstrjo2(t) + Mssejo2(t) 


or 


1 (n—1)/2 1 (n—k)/2 
(=) = Mssrrjo2(t) (=) 


which implies that 


1 \@DP 
Mssrrjo2(t) = (, = =) 


Appendix 12.A.3 The Distribution of ey When H, Is True 


Theorem 12.2.5 gives the distribution of the test statistic 


__ SSTR/(k — 1) 
~ SSE/(n—k) 


when the null hypothesis is true. To calculate either the power of the analysis of 
variance or the probability of committing a Type II error, though, requires that we 
know the pdf of the observed F when H, is true. 


Theorem 
12.A.3.1 
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SSE/(n—k) 


Definition 12.A.3.1. Let V; have a normal pdf with mean jy, and variance 1, 
for j=1,...,r, and suppose that the V;’s are independent. Then 


V= 3 V; 
j=l 


is said to have the noncentral x* distribution with r degrees of freedom and 
noncentrality parameter y, where 


, 
v=o 
j=l 


The moment-generating function for a noncentral x* random variable, V, with r 
degrees of freedom and noncentrality parameter y is given by 


My(t)=(1—21)"2eT7, t< ; 
Proof We begin by finding the moment-generating function for the special case 
where r= 1. 

Let V be a normal random variable with mean yw and variance 1, and let V = 
Z +, where Z is a standard normal random variable. By definition, the moment- 
generating function for V* can be written 


My2(t) =E (”’) =E Ed 


1 1 
aie ae 2°" dz = elena dz 


1 ~ 1 ag 
i V2 i V20 [. 
To evaluate the integral, we first complete the square in the exponent: 


1 
tz? + 2tze+ tp? — ae 


1 2 2 
= —5 lal — 2t)z° —4tuz|t+tu 


; a At : 
Se Ag OSI) aoe 

2 1/. —2r) 

3 4tu i At? 2 27 
= i (1 —2r) (1 — 2r)? ppg: (1 — 21)? 
i 1/(1 — 28) , 1/(. —2r) 
2t 2 
Zi 2,2 

1 ae 2t 
_1 | 7 G=2) | , 2, 2%# 

2 1/1 —2t (1 — 2r) 


Therefore, 


~_ 2tm 7? 
1 | 7 0=25 
ager: Mo. of? | | te 
My2(t) =e" ‘* 127 — e 
V2 


=(1—21)-2e Ea 


—ooO 
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Theorem 
12.A.3.2 


The general result, where r 4 1, follows from an application of Theorem 


3.12.3(b). Let V= > Ve, where the V;’s are independent. Then 
j=l 


Ye x yt 
Mey (t) =(1 — 28) be = (1-21) bets 


j=l 


Definition 12.A.3.2. Let V; be a noncentral y* random variable with rj 
degrees of freedom and noncentrality parameter y. Suppose V> is a (central) 
x° random variable with r. degrees of freedom and independent of V;. The 
ratio 

V, /"1 

V2/r2 
is said to have a noncentral F distribution with r, and rz degrees of freedom and 
noncentrality parameter y. 


The ratio 


SSTR/(k — 1) 
SSE/(n —k) 


has a noncentral F distribution with k —1 and n—k degrees of freedom and 


1 
noncentrality parameter y = =i Yo nj (uj - bys 
O~ j=] 


Proof From Equation 12.2.1, 


k 
SSTR = Yon, (¥.; —p) —n(¥. —p) 


so 
1 yp 2 = 2 
as Y..- 
oe a, e (12.A.3.1) 
a oJ a//n 
Vj—p . _ 
Let Wj; = s/ =1,...,k. Since E(Y.;) = uj, ECW) = /nj (uj — w)/o. Also, 
V: yr 2/n; = 
Var(W;) = Wt at a = 1. Thus, because Y_; is normal, each W; is nor- 


o2/nj  — o2/nj 
mal with mean ./nj (4; — )/o and variance 1. The second component of SSTR/o?, 


Y.-p. ; 
ag is a standard normal random variable. 
o/ Jn 


Now, recalling the transformation technique used in Appendix 7.A.2, choose an 
orthogonal matrix A with first row (/11/n, /n2/n,...,./nx/n). Define the vector Vv 
of random variables by v= A(W), Wo,..., We). are note that 
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SSE/(n—k) 


ja V2 j=l 
i ba 1 k k 
eras eae ofA Lni¥ 2" M 
i Y.—p 


Roa Y.-p 
which gives V7 = 
a/J/n 


Because of the orthogonality of the matrix, 


k k k k 
Lev, oe DU = DLW ear 
j=l 


j=l j=l j=2 


k 
But )° W; — V? =SSTR/o* by Equation 12.A.3.1. Moreover, each V; is a normal 


j=l 


k 

random variable for which V; = )° a;;W;, where the a;;’s are the entries in the jth 
i=l 

row of A. Therefore, 


k k k 
Var(V;) = )_ Var(ajiW;) = a3, Var(W;) = >> ai, 
i=l i=l i 

since each W; has variance 1. But the orthogonality of matrix A implies that 
k k 

a= 1 for each j. So each V; is normal with variance 1, and }° Vv} has a 
i=l j=2 
noncentral x” distribution with k — 1 degrees of freedom. 


k 
From Question 12.A.3.4, the noncentrality parameter of > Vv; is 
j=2 


k 
Yiv? ])-&-D=E 2 Wi G1) 
j=2 


> 


[Var(W;) + LE(W,) 7] — [Var(V,) + E(V1)"] — & - 1) 


k 
j= 


a 
= 


=o [1+ Lau; - )/oP]- (+0) - &-1) 


j= 


l k 
=e ye (Hj — 
j= 


Therefore, since SSE/o? has a x? distribution with n — k degrees of freedom, it 
follows immediately from Definition 12.A.3.2 that, when H, is true, 


SSTR 


= 
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has a noncentral F distribution with k — 1 and n—k degrees of freedom and 


1 & 
noncentrality parameter y = — ))nj(4j — By. 
Oo j=l 


Comment As H; gets farther from Ho, as measured by y, the noncentral F will shift 
more and more to the right of the central F. Accordingly, the power of the F test 


will increase. That is, 


P(F = F\~e,k—1,n—k) > 1 


as y > 0Oo 


The pdf for the noncentral F is not very tractable, but its integral has been evaluated 
by numerical approximation. This has allowed the power function for the F test to 
be tabulated [see, for instance, (108)]. 


Questions 


12.A.3.1. Suppose an experimenter has taken three inde- 
pendent measurements on each of five treatment levels 
and intends to use the analysis of variance to test 


Ls (=0) 


Ao: [y= 2 = 3 = M4 


versus 


H,: not all the j1;’s are equal 


Two of the possible alternatives in H, are 


Ay: f= 1, M2 =2, 3 =0, Wg = 1, Us 2 


and 


AY: by 3, M2 =2, w3=1, Wg =0, us =0 


Against which alternative will the F test have the greater 
power? Explain. 


12.A.3.2. In the scenario of the previous question, is 
Ay: wy = 2, to =1, w3= 1, ta 3, ws =0 an “admissible” 
alternative hypothesis? 


12.A.3.3. If the random variable V has a noncentral 
x’distribution with r degrees of freedom and noncentral- 
ity parameter y, use its moment-generating function to 
find E(V). 


12.A.3.4. If the random variable V has a noncentral x? 
distribution with r degrees of freedom and noncentrality 
parameter y, show that y = E(V) —r. 


12.A.3.5. Suppose V,, V2,...,V, are independent non- 
central x? random variables having r,,2,...,7r, degrees of 
freedom, respectively, and with noncentrality parame- 
ters yj, 2,---, Ym. Find the distribution of V = V, + 
Vo +---+V,. 
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t Test) 
Appendix 13.A.1 Minitab Applications 


“... when I first came to study statistical methods, nothing was further from my 

thoughts, or from those of my contemporaries, than that the art of experimental 

design would ever come to be, as it now surely is, an integral part of the subject.” 
—Ronald A. Fisher, 1947 


13.1 Introduction 


In any experiment, reducing the magnitude of the experimental error is a highly 
desirable objective: The smaller o? is, the better will be our chances of rejecting a 
false null hypothesis. Basically, there are two ways to reduce experimental error. The 
nonstatistical approach is simply to refine the experimental technique—use better 
equipment, minimize subject error, and so on. The statistical method, which can 
often produce results much more dramatic, is to collect the data in “blocks,” in what 
is referred to as a randomized block design. 

Historically, it was Fisher who first advanced the notion of blocking. He saw it 
initially as a statistical defense against the obfuscating effects of soil heterogeneity in 
agricultural experiments. Suppose, for example, a researcher wishes to compare the 
yields of four different varieties of corn. Figure 13.1.1(a) shows the simplest exper- 
imental layout: Variety A is planted in the leftmost portion of the field, variety B 
is planted next to A, and so on. Even to a city slicker, though, the statistical haz- 
ards in using this design should be obvious. Suppose, for example, there was a soil 
gradient in the field, with the best soil being in the westernmost part (where variety 
A was planted). Then if variety A achieved the highest yield, we would not know 
whether to attribute its success to its inherent quality or to its location (or to some 
combination of both). 

A more sensible approach is pictured in Figure 13.1.1(b). There the field is 
divided into a number of smaller “blocks,” each block being still further parceled 
into four “plots.” All four varieties are planted in each block, one to a plot, with the 
plot assignments being chosen at random. Notice that the geographical contiguity of 
the four plots within a given block ensures that the environmental conditions from 
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Figure 13.1.1 Two different 
experimental designs. 


(a) (b) 


plot to plot will be relatively uniform and will not lead to any biasing of the observed 
yields. What the analysis of variance will then do is “pool” from block to block the 
within-block information concerning the treatment differences while bypassing the 
between-block differences —that is, the heterogeneity in the experimental environ- 
ment. As a result, the treatment comparisons can be made with greater precision. 
Analytically, where the total sum of squares was partitioned into two components in 
a completely randomized one-factor design, it will be split into three separate sums 
in a randomized block design: one for treatments, another for blocks, and a third for 
experimental error. 

It did not take long for scientists to realize that the benefits of blocking could 
be extended well beyond the confines of agricultural experimentation. In medical 
research, blocks are often made up of subjects of the same age, sex, and overall 
physical condition. A common practice in animal studies is to form blocks out of 
littermates. Industrial experiments often require that “time” be a blocking criterion: 
Measurements taken by personnel on the day shift might be considered one block 
and those taken by the night shift a second block. In some sense the ultimate form 
of blocking, although one not always physically possible, is to apply the entire set of 
treatment levels to each subject, thus making each subject its own block. 

Section 13.2 begins with a development of the analysis of variance for the ran- 
domized block design, where k treatment levels are administered within each of b 
blocks. The observations within a given block, of course, are dependent. As was the 
case in Chapter 12, the hypotheses to be tested are Ho: uw) = “2 =... = Me VeTSUS 
Hf: not all the jz;’s are equal. The section concludes with a pair of case studies that 
illustrate the blocking concept in two very different settings. 

We saw in the previous chapter that when k = 2, and the samples are indepen- 
dent, the F test is equivalent to a two-sample t test. A similar duality exists here. 
When k = 2 treatment levels are compared within b blocks, Ho: 41 = 42 can be tested 
using either the analysis of variance or a paired t test. The latter is described in 
Section 13.3. 


13.2 The F Test for a Randomized Block Design 


Superficially, the structure of randomized block data looks much like the format we 
encountered in Chapter 12—associated with each of k treatment levels is a sam- 
ple of measurements. Here, though, each column has exactly the same number of 
observations—that is, n; = b for all j, so the data set is necessarily a b x k matrix 
(see Table 13.2.1). 

On the other hand, from a statistical standpoint randomized block data are fun- 
damentally different from k-sample data (recall the discussion in Section 8.2). In 
Chapter 12, the k samples were independent. Here, the observations within a given 
row (which corresponds to a block) are dependent, since each reflects to some extent 
the conditions inherent in that block. That distinction causes the analysis of variance 
to proceed differently. 
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Table 13.2.1 
Treatment Level 
Block Block True Block 
1 2) k Totals Means Effects 
1 Yu Yr nee Vix T\. ig By 
Blocks 2 Yo) Yo Yn Th, Yo. Bo 
b Yp1 Y, b2 Y, bk Th. Y;. B b 
Sample totals T, T> Tk T. 
Sample means ¥, Yo «a. ¥y Y. 
True means Ly [2 Lk 
Our objective is to test Ho: 4) = U2 =... = Lx, the same as it was in Chapter 12. 


But here the mathematical model associated with Y;; has an additional term, rep- 
resenting the effect of the ith block. If each “block effect,” 6;, is assumed to be 
additive, we can write 


Yij=ujt Bit ei; 


where ¢;; is normally distributed with mean zero and variance o*, fori=1,2,...,b 
and j =1,2,...,k. As before, we will let uw denote the overall average treatment 


: : . : ee 
effect associated with the bk observations —that is, uw = k pj. 
j=l 


The basic approach followed in Chapter 12 can still be taken here, but SSE 
needs to be recalculated, because the “error” in a set of randomized block measure- 
ments will reflect both the block effect and the random error. To separate the two 


requires that we first estimate the set of block effects, 61, Bo, ..., and Bp. 
= k 
Let Y; = : >> ¥;; denote the sample average of the k observations in the ith 
j=l 
block. Suppose the data contained no random error—that is, ¢;; = 0 for all i and j. 
Then 


k 


k 

_ 1 1 1 

¥i=5> (uj; + B)= > Lj Sr al 
j=l 


j=l 


If Y. is substituted for jz, the estimate for B; becomes Y;. — Y.. 
Now, adding and subtracting Y;, — Y., in the expression for SSE from Chapter 12 
gives 


Sot Yi) =D hs Y)+(% Y¥.)-(¥.-¥.)J 
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Theorem 
13.2.1 


Theorem 
13.2.2 


Notice that the cross-product term can be written 


But 


k k 
Yo (%ij - YG -Vi t¥..) =kY;, —kY;,-— (Vj -Y.) =0 


SO 


Equation 13.2.1 is a key result. It shows that the “old” sum of squares for error 


bk - 
from Chapter 12—)* > (¥;; — Yj)’ —can be partitioned into the sum of two other 
i=1 jal 


at ee 

sums of squares. The first, >> )> (Y;, — Y..)”, is called the block sum of squares and 
i=1 j=l 

denoted SSB. The second is the “new” sum of squares measuring random error. 

That is, for randomized block data, 


bok 
SSE= ~~ (¥s -—¥5-¥i.+¥.) 
i=1 j=l 


The other sums of squares from Chapter 12 remain the same in the context of 
the randomized block design. Specifically, 


SSTOT = total sum of squares = s s (Y; i vy 


i=1 j=l 


and 


qe) 


SSTR = treatment sum of squares = ye 
Get 


k 
i=l j= 


Suppose that k treatment levels are measured over a set of b blocks. Then 


a. SSTOT = SSTR + SSB + SSE. 
b. SSTR, SSB, and SSE are independent random variables. 


Proof The independence of the three terms that combine to give SSTOT can be 
established using the same approach that was taken in Chapter 12. The details will 
be omitted. 


Suppose that k treatment levels, with means [1,, [42,..., Le, are measured over a set of 
b blocks, where the block effects are B, B2,..., By. Then 


a. When Ho: 1) = U2 =... = Uy is true, SSTR/o? has a chi square distribution with 
k — 1 degrees of freedom. 
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b. When Ho: 8 = fo =...= By is true, SSB/o? has a chi square distribution with b — 1 
degrees of freedom. 

c. Regardless of whether the ;’s and/or the B;’s are equal, SSE/o* has a chi square 
distribution with (b — 1)(k — 1) degrees of freedom. 


Proof The proofs are similar to those for Theorems 12.2.2 and 12.2.3. 


Theorem Suppose that k treatment levels with means [1,, [12, ..., Lx are measured over a set of 
13.2.3 b blocks. Then 


a. If Ho: 4) = 2 =... = Mg is true, 


_ SSTR/(k = 1) 
~ SSE/(b— 1)(k—1) 


has an F distribution with k — 1 and (b — 1)(k — 1) degrees of freedom. 
b. At the a level of significance, Ho: 41 = W2 =... = Ux Should be rejected if F = 
Fo, k—-1,(b—1)(k-1): 


Theorem Suppose that k treatment levels are measured over a set of b blocks, where the block 
13.2.4 effects are B,, B2,..., and By. Then 


a. If Ho: Bi = Bo =...= Bp is true, 


_ SSB/(b=1) 
~ SSE/(b— 1)(k—1) 


has an F distribution with b — 1 and (b — 1)(k — 1) degrees of freedom. 
b. At the a level of significance, Ho: B; = B2 =... = By should be rejected if F > 
F\~a,b-1,(b-I(k-1) 


Table 13.2.2 shows the ANOVA table entries for a randomized block analysis. 
Notice that two F ratios are calculated, one for the treatment effect and one for the 


block effect. 

Table 13.2.2 

Source df SS MS F P 
SSTR/(k — 1 

Treatments k-1 SSTR SST R/(k —1) SSE/b ve D PL Fy-1,@-1ya-1) = obs. F] 
SSB/(b—1) 

Blocks b-1 SSB SSB/(b— 1) SSE/—D(k—1) P[Fy-1,@-1)4-1) = obs. F] 

Error (b-1(k-1) SSE SSE/(b—W(k—1) 

Total n—-1 SSTOT 
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Computing Formulas 


Let C =T?/bk. Then 


k 2 
Sipe) 2 13.2.2 
ye (1822) 
j=l 
b T2 
SSB=)°'=-C 13.2.3 
2 i ( ) 
b k 
SSTOT=)°)\¥;,-C (13.2.4) 


i=1 j=l 
and, by Theorem 13.2.1, 
SSE =SSTOT — SSTR— SSB 


Equations 13.2.2, 13.2.3, and 13.2.4 are considerably easier to evaluate than their 
counterparts on p. 632. The proofs will be left as exercises. 


Case Study 13.2.1 


Acrophobia is a fear of heights. It can be treated in a number of different 
ways. Using contact desensitization, a therapist demonstrates some task that 
would be difficult for someone with acrophobia to do, such as looking over 
a ledge or standing on a ladder. Then he guides the subject through the very 
same maneuver, always keeping in physical contact. Another method of treat- 
ment is demonstration participation. Here the therapist tries to talk the subject 
through the task; no physical contact is made. A third technique, live modeling, 
requires the subject simply to watch the task being done—he does not attempt it 
himself. 

These three techniques were compared in a study involving fifteen volun- 
teers, all of whom had a history of severe acrophobia (144). It was realized at 
the outset, though, that the affliction was much more incapacitating in some sub- 
jects than in others, and that this heterogeneity might compromise the therapy 
comparison. Accordingly, the experiment began with each subject being given 
the Height Avoidance Test (HAT), a series of forty-four tasks related to ladder 
climbing. A subject received a “point” for each task successfully completed. On 
the basis of their final scores, the fifteen volunteers were divided into five blocks 
(A, B, C, D, and E), each of size 3. The subjects in Block A had the lowest scores 
(that is, the most severe acrophobia), those in Block B the second lowest scores, 
and so on. 

Each of the three therapies was then assigned at random to one of the three 
subjects in each block. When the counseling sessions were over, the subjects 
retook the HAT. Table 13.2.3 lists the changes in their scores (score after ther- 
apy — score before therapy). Test the hypothesis that the therapies are equally 
effective. Let a=0.01. 


(Continued on next page) 


13.2 The F Test for a Randomized Block Design 635 


Table 13.2.3 HAT Score Changes 
Therapy 
Contact Demonstration Live 

Block Desensitization Participation Modeling T;. 
A 8 2 —2 8 
B 11 1 0 12 
C 9 12 6 27 
D 16 11 2 29 
E 24 19 11 54 
T; 68 45 17 130 


5 3 
Since C = (130)?/15 = 1126.7 and > > Y; = 1894, it follows that 


i=l j=l 


SSTOT = 1894 — 1126.7 = 767.3 


8)? 54)? 

ssp=0 indole - 1126.7 = 438.0 
68)? (45)? (17)? 

ssTr= © += +! 2 1126.7 = 260.9 


giving an error sum of squares of: 
SSE = 767.3 — 438.0 — 260.9 = 68.4 


The analysis of variance is summarized in Table 13.2.4. Since the calculated 
value of the F statistic, 15.260, exceeds F'99,2.3 = 8.65, Ho: 41 = U2 = 3 can be 
rejected at the 0.01 level. In fact, the P-value of 0.0019 indicates that Hp can be 
rejected for @ as small as 0.0019. 


Table 13.2.4 
Source df SS MS F P 
Therapies 2 260.93 130.47 15.260 0.0019 
Blocks 4 438.00 109.50 12.807 0.0015 
Error 8 68.40 8.55 
Total 14 767.33 
The small P-value for “Blocks” (=0.0015) implies that Ho: 6; = B2 =...= Bs 


would also be rejected. Of course, that should come as no surprise: The 
blocks were intentionally set up to be as different as possible. Indeed, if 
__-SSB/(b—1) 
~ SSE/(b —1)(k —1) 
validity of using HAT scores to measure the severity of acrophobia. 


had not been large, we would have questioned the 


Comment Using a randomized block design instead of a one-way design is a trade- 
off. The blocks result in SSE being reduced (recall Equation 13.2.1), and that 
increases the probability of rejecting Hp when Hp is false, provided everything else 
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associated with the test is kept the same. But everything else is not kept the same: 
The degrees of freedom associated with “error” in the randomized block analysis 
[= (b — 1)(k — 1)] are fewer than the degrees of freedom associated with “error” 
in the one-way analysis [= k(b — 1)]. That difference is an advantage of the one- 
way analysis because the power of any hypothesis test diminishes as the number of 
degrees of freedom associated with its test statistic decreases. 

Ultimately, which design is preferable in a given situation depends on the mag- 
nitude of SSB. If SSB were “large,” the advantage of a much smaller SSE would 
more than compensate for the reduction in the degrees of freedom for “error,” and 
the randomized block design would be a better choice than the one-way design. On 
the other hand, if the block effects were essentially all the same (in which case SSB 
would be small), then SSE for the randomized block design would not be much 
smaller than the SSE for the one-way design. In that case, the degrees of freedom 
for “error” becomes the key issue, and the one-way design would be considered 


preferable. 
If an experiment has already been done as a randomized block design, an F 
test of Ho: 61 = Bo =...= Bp» provides some guidance as to how the treatments might 


best be compared if a follow-up study were to be done. If the F test of Ho: 6B; = B2 = 
... = B, rejects Ho, the decision to use the randomized block design was the right 
one (especially if the P-value for Blocks is very small). If Ho: Bi = Bo =...= By is 
not rejected, future experiments comparing those same treatments should probably 
either (1) utilize a different blocking criterion or (2) be set up as a one-way design. 


Case Study 13.2.2 


Rat poison is normally made by mixing its active chemical ingredients with ordi- 
nary cornmeal. In many urban areas, though, rats can find food that they prefer 
to cornmeal, so the poison is left untouched. One solution is to make the corn- 
meal more palatable by adding food supplements such as peanut butter or meat. 
Doing that is effective, but the cost is high and the supplements spoil quickly. 

In Milwaukee, a study was carried out to see whether artificial food sup- 
plements might be a workable compromise (85). For five two-week periods, 
thirty-two hundred baits were placed around garbage-storage areas — eight hun- 
dred consisted of plain cornmeal; a second eight hundred had cornmeal mixed 
with artificial butter-vanilla flavoring; a third eight hundred contained cornmeal 
mixed with artificial roast beef flavoring; and the remaining eight hundred were 
cornmeal mixed with artificial bread flavoring. 

Table 13.2.5 lists, for each survey, the percentage of each type of bait that 
was eaten. Do the rats show any preferences for the different flavors? Were the 


Table 13.2.5 


Survey Number Plain Butter- Vanilla Roast Beef Bread 


1 13.8 11.7 14.0 12.6 
2 12.9 16.7 15.5 13.8 
3 25.9 29.8 27.8 25.0 
4 18.0 23.1 23.0 16.9 
5 15.2 20.2 19.0 13.7 


(Continued on next page) 


Theorem 
13.2.5 


Example 
13.2.1 
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blocks—in this case, the surveys—helpful in reducing the error sum of squares? 
If a follow-up study were to be done comparing these same baits, should a 
completely randomized design or a randomized block design be used? 

All of these questions are answered by the F ratios shown in Table 13.2.6. 
The P-values for Ho: Ww) = 2 = U3 = 4 (= 0.0042) and Ho: 6; = fo = B3 = 
$4 = Bs (= 0.0000) are both extremely small, so both null hypotheses would be 
rejected. Moreover, the fact that SSB is so very large indicates that consider- 
able variation exists from survey to survey, irrespective of the baits. It follows 
that any future studies should be set up in a similar fashion—that is, using a 
randomized block design. 


Table 13.2.6 

Source df SS MS F P 
Flavors 3 56.38 18.79 758 0.0042 
Surveys 4 495.32 123.83 49.93 0.0000 
Error 12 29.76 2.48 

Total 19 581.46 


Tukey Comparisons for Randomized Block Data 


The Tukey pairwise comparison technique of Section 12.3 can also be applied 
to a randomized block design. The definition of D is slightly different, since 
the associated studentized range is no longer Q;,,,.-, but rather Q, j@—1)«-1), 
a change reflecting the number of degrees of freedom available for MSE in 
estimating o°. 


Let ar j=1,2,...,k, be the sample means in a b x k randomized block design. Let 
pt; be the true treatment means, j =1,2,...,k. The probability is 1 — a that all (5) 
pairwise subhypotheses Ho: [1s = [4; will simultaneously satisfy the inequalities 


Y,—Y,—DVMSE <p; — Ut; <¥5—Y1+DVMSE 


where D= Ouk. bk) /VB. If, for a given s and t, zero is not contained in the pre- 
ceding inequality, Ho: (4s = 4; can be rejected in favor of Hy: 4s 4 (4, at the a level of 
significance. 


Recall the comparison of the three acrophobia therapies in Case Study 13.2.1. The 
F test in Table 13.2.4 showed that Ho: w= 42 = 143 can be rejected at the w =0.05 (or 
even 0.005) level of significance. Should all three therapies, though, be considered 
different, or is one of them simply different from the other two? 

That question can be answered by constructing the set of 95% Tukey confidence 
intervals for the three pairwise comparisons. Here, 


4.04 
D= 0 05,3,8 = TT 


V5 2.24 
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and the radius of the Tukey intervals is 


DV MSE =1.81V8.55 =5.3 


Table 13.2.7 summarizes the calculations called for in Theorem 13.2.5. 

Now we have a much better picture of the relative values of these three 
therapies. Based on the Tukey intervals, the difference in the means for contact 
desensitization (4;) and demonstration participation (12) is not statistically signifi- 
cant. However, the increases in both the contact desensitization mean (j;) and the 
demonstration participation mean (j12) relative to the live modeling mean (13) are 
statistically significant. 


Table 13.2.7 
Pairwise Difference Vers Tukey Interval Conclusion 
[Li — [Ly 4.6 (—0.7, 9.9) Not significant 
[hy — [3 10.2 (4.9, 15.5) Reject 
[lo — [3 5.6 (0.3, 10.9) Reject = 


Contrasts for Randomized Block Data 


The techniques we learned in Section 12.4 for testing contrasts can be readily 
adapted to randomized block designs. If C is the contrast associated with the null 
hypothesis, the appropriate test statistic is 

_ SSc/1 

~ SSE/(b—1)(k—1) 
where F has 1 and (b — 1)(k — 1) degrees of freedom and SSE is the error sum of 
squares defined for randomized block data. 


Case Study 13.2.3 


In folklore, the full moon is often portrayed as something sinister, a kind of evil 
force possessing the power to control our behavior. Over the centuries, many 
prominent writers and philosophers have shared this belief (126). Milton, in 
Paradise Lost, refers to 


Demoniac frenzy, moping melancholy 
And moon-struck madness. 


And Othello, after the murder of Desdemona, laments: 


It is the very error of the moon, 
She comes more near the earth than she was wont 
And makes men mad. 


On a more scholarly level, Sir William Blackstone, the renowned eighteenth- 
century English barrister, defined a “Junatic” as 


(Continued on next page) 
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one who hath... lost the use of his reason and who hath lucid intervals, some- 
times enjoying his senses and sometimes not, and that frequently depending 
upon changes of the moon. 


The possibility of lunar phases influencing human affairs is a theory not 
without supporters among the scientific community. Studies by reputable medi- 
cal researchers have attempted to link the “Transylvania effect,” as it has come 
to be known, with higher suicide rates, pyromania, and even epilepsy. 

The relationship between lunar cycles and mental breakdowns has also 
been studied. Table 13.2.8 shows the admission rates to the emergency room 
of a Virginia mental health clinic before, during, and after the twelve full moons 
from August 1971 to July 1972 (11). Here, “time,” as expressed in months, is 
acting as the blocking variable. 


Table 13.2.8 
Admission Rates (patients/day) 
(1) (2) (3) 
Before Full During Full After Full 

Month Moon Moon Moon Ve 
Aug. 6.4 5.0 5.8 5.73 
Sept. 71 13.0 9.2 9.77 
Oct. 6.5 14.0 7.9 9.47 
Nov. 8.6 12.0 7.7 9.43 
Dec. 8.1 6.0 11.0 8.37 
Jan. 10.4 9.0 12.9 10.77 
Feb. 11.5 13.0 13.5 12.67 
Mar. 13.8 16.0 13.1 14.30 
Apr. 15.4 25.0 15.8 18.73 
May 15.7 13.0 13.3 14.00 
June 11.7 14.0 12.8 12.83 
July 15.8 20.0 14.5 16.77 
i 10.92 13.33 11.46 


Table 13.2.9 summarizes the ANOVA calculations. For 2 and 22 degrees of 
freedom, the 0.05 critical value for the lunar cycle effect is 3.44, which is greater 
than the observed F (=3.22). Therefore, we would fail to reject Ho: #1 = w2 = 3, 
and the conclusion would be that a lunar effect has not been demonstrated. 


Table 13.2.9 

Source df SS MS F 
Lunar cycles 2 38.59 19.30 3.22 
Months 11 451.08 41.01 

Error 22 132.08 6.00 

Total 35 621.75 


(Continued on next page) 
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(Case Study 13.2.3 continued) 


Testing the overall Ho: 1 = (42 = 43 is not the only appropriate way to ana- 
lyze these data, though. An a priori subhypothesis is clearly suggested by the 
circumstances of the problem —specifically, it would make sense to test whether 
the admission rate during the full moon is different from the average rate dur- 
ing the rest of the month. The null subhypothesis corresponding to such a test 


with Hp is 


and its estimate is 


would be Ho: w2 = (ti + 3)/2. 
Following the procedure outlined in Section 12.4, the contrast associated 


A 1 1 
C=— 7510.92) + 113.3) — 5 (11.46) 
=2.11 


From Definition 12.4.2, the sum of squares associated with C is: 


SSc= enc ae 35.62 
Oia 7 ee 
Pe I 


Dividing SSc by the mean square for error gives an F ratio of 5.93 (with 1 
and 22 degrees of freedom): 


For a = 0.05, though, F95,1,22 = 4.30. Therefore, contrary to our acceptance of 
Ao: 4) = L2 = 3, we would reject Ho: wz = (41 + 43)/2 and conclude that the 
Transylvania effect does exist. 


1 1 


C=-=Mm+"2-= 
an 2 53 


35.62/1 


——.— = 5.93 
132.08/22 


Comment It is always more than a little disconcerting when two statistical tech- 
niques applied to the same data lead to opposite conclusions. That such apparent 
contradictions occur, though, should not be unexpected. Different methods of anal- 
ysis simply utilize the data in different ways. Disagreements from time to time are 


inevitable. 


Questions 


13.2.1. In recent years a number of research projects 
in extrasensory perception have examined the possibil- 
ity that hypnosis may be helpful in bringing out ESP 
in persons who did not think they had any. The obvi- 
ous way to test such a hypothesis is with a self-paired 
design: the ESP ability of a subject when he is awake is 
compared to his ability when hypnotized. In one study 
of this sort, fifteen college students were each asked to 
guess the identity of 200 Zener cards (see Case Study 


4.3.1). The same “sender” — that is, the person concentrat- 
ing on the card—was used for each trial. For 100 of the 
trials both the student and the sender were awake; for the 
other 100 both were hypnotized. If chance were the only 
factor involved, the expected number of correct identifi- 
cations in each set of 100 trials would be 20. The observed 
average numbers of correct guesses for subjects awake 
and subjects hypnotized were 18.9 and 21.7, respectively 
(21). Use the analysis of variance to determine whether 


that difference is statistically significant at the 0.05 
level. 


Number of Correct Responses (out of 100) in ESP 
Experiment 


Sender and Student Sender and Student 


Student in Waking State in Hypnotic State 
1 18 25 
2 19 20 
3 16 26 
4 21 26 
5 16 20 
6 20 23 
7 20 14 
8 14 18 
9 11 18 

10 22 20 
11 19 22 
12 29 27 
13 16 19 
14 27 27 
15 15 21 
Y 18.9 21.7 


13.2.2. The following table shows the audience shares of 
the three major networks’ evening news broadcasts in four 
major cities as reported by Arbitron. Test at the a = 0.10 
level of significance the null hypothesis that viewing levels 
for news are the same for ABC, CBS, and NBC. 


City ABC CBS NBC 
A 19.7 16.1 18.2 
B 18.6 15.8 17.9 
Cc 19.1 14.6 15.3 
D 17.9 17.1 18.0 


13.2.3. A paint manufacturer is experimenting with an 
additive that might make the paint less chalky. To ensure 
that the additive does not affect the tint, a quality-control 
engineer takes a sample from each of seven batches 
of Osage Orange. Each sample is split in half, and the 


Batch Without Additive With Additive 


1 1.10 1.06 
2 1.05 1.02 
3 1.08 1.17 
4 0.98 1.21 
> 1.01 1.01 
6 0.96 1:23 
7 1.02 1.19 
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additive is put into one of the two. Both samples are 
examined with a spectroscope, with the output read in 
standardized lumen units. If the tint were exactly correct, 
the reading would be 1.00. Test that the mean spectro- 
scope readings are the same for the two versions of Osage 
Orange. Let a= 0.05. 


13.2.4. The number of new building permits can be a 
good indicator of the strength of a region’s economic 
growth. The following table gives percentage increases 
over a four-year period for three geographical areas. Ana- 
lyze the data. Let aw =0.05 . What are your conclusions? 


Year Eastern North Central Southwest 
2000 Pit: 0.1 0.9 
2001 1.3 0.8 1.0 
2002 2.9 1.1 1.4 


2003 3.5 1.3 1.5 


13.2.5. Analyze the Transylvania effect data in Case 
Study 13.2.3 by calculating 95% Tukey confidence inter- 
vals for the pairwise differences among the admission 
rates for the three different phases of the moon. How do 
your conclusions agree with (or differ from) those already 
discussed on p. 640? Let Q o5,3.22 = 3.56. 


13.2.6. The table below gives a stock fund’s quarterly 
returns for the years 2003 to 2007. Are the results affected 
by the quarter of the year? Is the variability in the return 
from year to year statistically significant? State your con- 
clusions using the w = 0.05 level of significance. 


Quarter 
Year First Second Third Fourth 
2003 —5.29 8.62 5.23 6.44 
2004 4.96 1.06 —0.25 6.32 
2005 0.11 0.58 5.46 3.01 
2006 5.30 0.82 4.81 6.54 
2007 1:71 5.41 —1.92 —4.78 


13.2.7. Find the 95% Tukey intervals for the data of 
Question 13.2.2, and use them to test the three pairwise 
comparisons of ABC, CBS, and NBC. 


13.2.8. A comparison was made of the efficiency of four 
different unit-dose injection systems. A group of pharma- 
cists and nurses were the “blocks.” For each system, they 
were to remove the unit from its outer package, assemble 
it, and simulate an injection. In addition to the standard 
system of using a disposable syringe and needle to draw 
the medication from a vial, the other systems tested were 
Vari-Ject (CIBA Pharmaceutical), Unimatic (Squibb), and 
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Tubex (Wyeth). Listed in the following table are the aver- 
age times (in seconds) needed to implement each of the 
systems (149). 


Average Times (sec) for Implementing Injection 
Systems 


Subject Standard Vari-Ject Unimatic Tubex T;. 


1 35.6 17.3 24.4 25.0 102.3 
2 31.3 16.4 22.4 26.0 96.1 
3 36.2 18.1 22.8 25.3 102.4 
4 31.1 17.8 21.0 24.0 93.9 
5 39.4 18.8 23.3 24.2 105.7 
6 34.7 17.0 21.8 26.2 99.7 
7 34.1 14.5 23.0 24.0 95.6 
8 36.5 17.9 24.1 20.9 99.4 
9 32.2 14.6 23.5 23.5 93.8 
10 40.7 16.4 31.3 36.9 125.3 
T 351.8 168.8 237.6 256.0 1014.2 


— 


(a) Test the equality of the means at the 0.05 level. 
(b) Use Tukey’s method to test all six pairwise differ- 
ences of the four j1;’s. Let a =0.05. 


(Note: SSTOT = 2056.10, SST R = 1709.60, and SSB = 
193.53; QO 05,4.27 = 4,34.) 


13.2.9. Heart rates were monitored (10) for six tree 
shrews (Tupaia glis) during three different stages of sleep: 
LSWS (light slow-wave sleep), DSWS (deep slow-wave 
sleep), and REM (rapid-eye-movement sleep). 


Heart Rates (beats/5 seconds) 


Tree Shrew LSWS DSWS REM 


1 14.1 11.7 15.7 
2 26.0 21.1 21.5 
3 20.9 19.7 18.3 
4 19.0 18.2 17.0 
5 26.1 23.2 22.5 
6 20.5 20.7 18.9 


(a) Do the analysis of variance to test the equality of the 
heart rates during these three phases of sleep. Let 
a =0.05. 

(b) Because of the marked physiological difference 
between REM sleep and LSWS and DSWS sleep, it 
was decided before the data were collected to test 
the REM rate against the average of the other two. 
Test the appropriate subhypothesis with a contrast. 
Use the 0.05 level of significance. Also, find a sec- 
ond contrast orthogonal to the first and verify that 
the sum of the sum of squares for the two contrasts 
equals SSTR. 


13.2.10. Refer to the rat poison data of Case Study 13.2.2. 
Partition the treatment sum of squares into three orthogo- 
nal contrasts. Let one contrast test the hypothesis that the 
true acceptance percentage for the plain cornmeal is equal 
to the true acceptance percentage for the cornmeal with 
artificial roast beef flavoring. Let a second contrast com- 
pare the effectiveness of the “butter-vanilla” and “bread” 
baits. What does the third contrast test? Do all testing at 
the a =0.10 level of significance. 


13.2.11. Prove the computing formulas given in Equa- 
tions 13.2.2, 13.2.3, and 13.2.4. 


13.2.12. Differentiate the function 
b k 
L= SS YOu — Bi- By” 
i=l j=l 


with respect to all bk parameters and calculate the least 
squares estimates for the 6;’s and j1;’s. 


13.2.13. True or false: 


(b) Either MSTR or MSB or both are greater than or 
equal to MSE. 


13.2.14. For a set of randomized block data comparing k 
treatments within b blocks, find 


(a) E(SSB) 
(b) E(SSE) 


13.3 The Paired t Test 


The randomized block design is typically used when three or more treatment levels 
are to be compared within a set of b blocks. If an experiment involves b blocks but 
only two treatment levels, the ANOVA described in Table 13.2.2 can still be used, 
but a computationally simpler (and equivalent) approach is to do a paired t test. 
The latter has the additional advantage in that it shows more clearly how the use of 
blocks can facilitate the comparison of treatments. 
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By definition, a “pair” represents a more or less constant set of conditions under 
which one measurement can be made on Treatment X and one measurement on 
Treatment Y. Paired data, then, consist of measurements taken on Treatment X and 
Treatment Y within each of b pairs. In effect, the paired t test pools the treatment 
response differences within each pair from pair to pair. 

Recall Figure 8.2.4. The two observations recorded for the ith pair can be 
written 


Xj =Uxt+ B+ 6; 
and 

Y;=py+B +e} 
where 


1. jy and py are the true means associated with Treatment X and Treatment Y, 
respectively 


and 


2. B; is the effect—that is, the numerical contribution to the measurement— 
resulting from the conditions defining Pair i (B; may be positive, negative, or 
Zero). 


For the purposes of this section, it will also be assumed that ¢; and ¢; are inde- 
pendent, normally distributed random variables, each having mean zero, but with 
variances o3 and o7, respectively. 

Notice that B; “disappears” when the two measurements within a pair are 
subtracted: 


Dj = wx + Be +6; — (uy + Bi +6;) 
= ky — by +6; — &; (13.3.1) 
Moreover, it follows that 


1 E(D))="p =x — Ly 
2. Var(D;) =o; = oy + Oe 


and 
3. D; is normally distributed 


Equation 13.3.1 is the key to understanding how the paired-data design works. 
Suppose an experimenter recognizes that taking a measurement on Treatment X 
(under the conditions present in, say, Pair i) will result in a possibly sizeable B; 
being included in the observation. Since the actual magnitude of B; is unknown, its 
presence complicates the interpretation of what the observed measurement is telling 
us about the effect of Treatment X. What the experimenter should do in that situa- 
tion is take a measurement on Treatment Y under the same conditions that prevailed 
for the measurement on Treatment X. That measurement, then, will also include the 
component B;, but if the two observations are subtracted, Equation 13.3.1 shows 
that the resulting difference (1) will be free of B; and (2) will be an estimate for 
[Lp = Lx — Ly. In effect, the paired-data design allows for the comparison of Treat- 
ment X and Treatment Y to be made unencumbered by whatever differences might 
exist in the experimental environment. (A more specific example of this important 
idea will be described at length in Section 13.4.) 

Since Up = Lx — Ly, testing Hp: up = 0 is equivalent to testing Ho: wx = wy. The 
procedure for doing the former is known as a paired t test. The statistic for testing 
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Theorem 
13.3.1 


Ho: Lp = 0 is a special case of Theorem 7.3.5. If D; = X;—Y;,i=1,2,...,b,is a set 
of within-pair treatment differences, where D and Sp denote the sample mean and 
sample standard deviation of the D,’s, respectively, then 


D- Up 
Sp/Vb 


will have a Student r distribution with b — 1 degrees of freedom. 


Let d\,d>,...,d, be a random sample of within-pair treatment differences from a nor- 
mal distribution whose mean is Lp. Let d and sp denote the sample mean and sample 
standard deviation of the d;’s, and define t =d/(sp/b). 


a. To test Ho: Up = 0 versus Hy: up <0 at the a level of significance, reject Ho if 
t<—te,p-1- 

b. To test Hj: 4p =0 versus Hi: up > 0 at the a level of significance, reject Ho if 
t = ty,b-1- 

c. To test Ho: 4p =0 versus Hy: Up #0 at the a level of significance, reject Hp if t is 
either (1) < —ty/2,p-1 OF (2) > ta/2,p-1- 


Case Study 13.3.1 


Prior to the 1968 appearance of Kenneth Cooper’s book entitled Aerobics, the 
word did not appear in Webster’s Dictionary. Now the term is commonly under- 
stood to refer to sustained exercises intended to strengthen the heart and lungs. 
The actual benefits of such physical activities, as well as their possible detri- 
mental effects, have spawned a great deal of research in human physiology as it 
relates to exercise. 

One such study (73) concerned changes in the blood, specifically in 
hemoglobin levels before and after a prolonged brisk walk. Hemoglobin helps 
red blood cells transport oxygen to tissues and then remove carbon dioxide. 
Given the stress that exercise places on the need for that particular exchange, it 
is not unreasonable to suspect that aerobics might alter the blood’s hemoglobin 
levels. 

Ten athletes had their hemoglobin levels measured (in g/dl) prior to 
embarking on a sixty-kilometer walk. After they finished, their levels were 
measured again (see Table 13.3.1). Set up and test an appropriate Ho and Hj. 

If zx and jy denote the true average hemoglobin levels before and after 
walking, respectively, and if 4p = x — wy, then the hypotheses to be tested are 


Ho: Lp =0 
versus 
Ah: Up 40 


Let 0.05 be the level of significance. 
From Table 13.3.1, 


10 10 
\odi=4.7 and Sod? =8.17 
i=1 i=1 


(Continued on next page) 
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Table 13.3.1 
Subject Before Walk, x; After Walk, y, d;=x;—y; 
A 14.6 13.8 0.8 
B 17.3 15.4 1.9 
Cc 10.9 11.3 —0.4 
D 12.8 11.6 1.2 
E 16.6 16.4 0.2 
F 12.2 12.6 —0.4 
G 11.2 11.8 —0.6 
H 15.4 15.0 0.4 
I 14.8 14.4 0.4 
J 16.2 15.0 1.2 
Therefore, 
=. ll 
d= —(4.7) =0.47 
10 
and 
10(8.17) — (4.7)? 
3 
= ——___—— = 0.662 
°D 10(9) 


Since n = 10, the critical values for the test statistic will be the 2.5th and 
97.5th percentiles of the Student ¢ distribution with 9 degrees of freedom: 


+too5,.9 = +£2.2622. The appropriate decision rule from Theorem 13.3.1, 
then, is 
< —2.2622 
Reject Ho: up = Oif is either 4} or 
sp/V10 > 2.2622 
In this case the tf ratio is 
0.47 
—_________ = | 827 
/0.662// 10 


and our conclusion is to fail to reject Hy: The difference between d(= 0.47) and 
the Hp value for jzp(=0) is not statistically significant. 


Case Study 13.3.2 


Some rental car agencies promise to offer lower-cost rentals. Among them are 
the aptly named Budget and Thrifty. But is Thrifty really thriftier? Table 13.3.2 
shows the rates charged by these two companies for a midsize sedan rented 
midweek with a month’s notice at each of eleven major airports. According to 
the d;’s listed in the last column, d = $6.29 (= average Budget rate — average 
Thrifty rate). The parameter of interest here is jzp, the true average difference 
between the Budget and Thrifty rates. One question to be answered is whether 


(Continued on next page) 
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(Case Study 13.3.2 continued) 


Table 13.3.2 

Airport Budget, x; Thrifty, y; dj =X; — Yj 
Atlanta 93.74 88.54 5.20 
Baltimore 129.75 125.00 4.75 
Charlotte 100.33 99.03 1.30 
Chicago 111.04 104.14 6.90 
Dallas—Ft. Worth 167.15 162.08 5.07 
Denver 149.56 141.41 8.15 
Los Angeles 124.26 122.99 1.27 
Miami 108.57 102.51 6.06 
New Orleans 118.62 117.44 1.18 
Seattle 129.81 121.76 8.05 
St. Louis 98.58 77.32 21.26 
Source: www.expedia.com 


the sample mean of d = $6.29 is sufficiently positive to overturn the presumption 
that Up= 0. 

Notice first of all that these x;’s and y;’s are dependent: the $93.74 and 
$88.54 in first row, for example, are lower than most of the rates at other air- 
ports and may reflect lower operating costs or less demand in Atlanta. That 
is, included in the $93.74 and $88.54 is the B, referenced in Equation 13.3.1. 
Similarly, a portion of the $129.75 and $125.00 is the B, for Baltimore; and 
so on. 

A confidence interval for jz» will provide an estimate of the savings asso- 
ciated with renting Thrifty midsize sedans and also give us a way of testing the 
two-sided hypothesis that pp = 0. 

Theorem 7.4.1 applies to the d;’s, so the form of the 100(1 — w)% confidence 
interval is 


os —_ s 
(a — te/2,b-1 Va d + ty/2,b-1 =) 


The average of the figures in the last column is d = $6.29 and the sample 
standard deviation is s = $5.59. For a =0.05, the Student r value is 


—t.025,b-1 = —1,.025,10 = 2.2281 


so the 95% confidence interval reduces to 
5.59 5.59 
6.29 — 2.2281 ——, 6.294 2.2281 =) = ($2.53, $10.05) 
( V11 V11 


Moreover, since 0 is not in the confidence interval, we can reject the null 
hypothesis that wp =0. 


About the Data The difference in rental costs in St. Louis is clearly an “outlier” 
and possibly results from a Thrifty promotion of some kind. The distortion that such 
a deviant quantity introduces suggests that a better strategy would be to compare 
average rental costs over an extended period of time. 
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Criteria for Pairing 


The Comment following Case Study 13.2.1 discusses the issues an experimenter 
needs to consider in deciding whether the comparison of k treatment levels should 
be done as a k-sample design or a randomized block design. When two treatment 
levels are to be compared, similar questions need to be addressed: 


1. Should the comparison be done with independent samples (and the two-sample 
t test) or with dependent samples (and the paired r test)? 

2. If the paired-data model is the experimental design chosen, what criterion 
should be used to define the pairs? 


The pros and cons of using dependent samples will be discussed in Section 13.4. 
Here we want to focus on some of the ways pairs are defined in real-world appli- 
cations. A representative sampling of blocking criteria in general is reflected in the 
five Case Studies appearing earlier in Chapter 13. 

The ultimate pairing criterion is to use each subject twice. Then the exper- 
imenter can be confident that whatever “contribution” a subject makes to the 
numerical value for Treatment X is exactly the same as the contribution made to 
the numerical value for Treatment Y. Over the years, “before and after” studies of 
this sort have become very popular with researchers. The aerobics/hemoglobin data 
described in Case Study 13.3.1 are a typical example. 

Not every experimental protocol, though, lends itself to the possibility of test- 
ing both treatments on each subject. Suppose the objective of a study is to compare 
two methods of teaching fractions to third graders. Once a subject is exposed to 
one method (and learns something about fractions), assessing the effectiveness of 
the second treatment would be problematic. Clearly, such a study needs to be 
done with pairs of two (similar) subjects, one being taught with Method X, and 
the other with Method Y. Defining what “similar” means in this case could be 
done in a variety of ways. The closest approximation to the “before and after” for- 
mat would be to use twins as subjects. If the number of twins available, though, 
was insufficient, “similar” could be defined in terms of IQ scores or previous math 
grades. 

Another widely used strategy for creating dependent observations is to pair up 
measurements taken close together in time and/or space. The rationale, of course, is 
that measurements sharing a variety of environmental characteristics will be inflated 
(or deflated) by similar amounts for both Treatment X and Treatment Y. The data 
in Tables 13.2.5, 13.2.8, and 13.3.2 are all cases in point. 

Probably the most challenging scenarios faced by experimenters are situations 
where there are no obvious pairing criteria of the sort just described. Rather, 
some sort of “pre-test” needs to be derived that would serve as a mechanism 
for identifying subjects likely to respond in similar ways to the two treatments. 
Recall, for example, the blocks defined in Case Study 13.2.1. There, the Height 
Avoidance Test (HAT) was used as a way of categorizing the severity of a sub- 
ject’s initial level of acrophobia. By defining blocks to be subjects with similar 
HAT scores, a set of relatively homogeneous experimental environments were cre- 
ated (Blocks A through E) within which all the competing therapies could be 
compared. 

It would be difficult to overestimate the importance of choosing the block- 
ing and pairing criteria carefully whenever the randomized block and paired-data 
designs are being used. Whatever can be done to minimize the additional variation 
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in the measurements due to specific environmental effects will allow the treatments 
to be compared with that much more precision. 


The Equivalence of the Paired t Test and the Randomized Block 
ANOVA When k = 2 


Example 12.2.2 showed that the analysis of variance done on a set of k-sample data 
when k =2 is equivalent to a (pooled) two-sample t test of Ho: wx = Wy against a two- 
sided alternative. Although the numerical values of the observed t and observed F 
will be different, as will be the locations of the two critical regions, the final inference 
will necessarily be the same. A similar equivalence holds for the paired t test and the 
randomized block ANOVA (when k = 2). 

Recall Case Study 13.3.1. Analyzed with a paired ¢ test, Ho: 4p = 0 should be 
rejected in favor of Hi: up #0 at the wa =0.05 level of significance if 


t < —ty/2,b-1 = —10.025,9 = —2.2622 orif t> ty/2,b-1 = 2.2622 


d 
Sp/V10 
Table 13.3 shows the Minitab input and output for doing the analysis of variance 
on those same observations. The observed F ratio for “Treatments” is 3.34, and the 
corresponding a = 0.05 critical value is Fo.95,1,9 =5.12. 


But t= = 1.83, so the conclusion is “fail to reject Ho.” 


Table 13.3.3 


MTB > set cl 

DATA > 14.6 17.3 10.9 12.8 16.6 12.2 11.2 15.4 14.8 16.2 
DATA > 13.8 15.4 11.3 11.6 16.4 12.6 11.8 15.0 14.4 15.0 
DATA > end 

MTB > set c2 

DATA S20 ds Dok. a ee de Ae 2b 2 22D De 2D. 2 

DATA > end 

MTB > set c3 

DATA > 123456789 10123456%7 8 9 10 

DATA > end 

MTB > name cl ‘Hemglb’ c2 ‘Activ’ c3 ‘Blocks’ 

MTB > twoway cl c2 c3 


Two-way ANOVA: Hemglb versus Activ, Blocks 


Source DF Ss MS F P 
Activ 1 1.1045 1.10450 3.34 0.101 
Blocks 9 73.2405 8.13783 24.57 0.000 
Error 9 2.9805 0.33117 

Total V9. T3255 


Notice that (1) the observed F is the square of the observed ¢ and (2) the F critical 
value is the square of the t critical value: 


3.34 = (1.827)? and 5.12 =(2.2622)° 


It follows, then, that the paired ¢ test will reject the null hypothesis that up =0 if 
and only if the randomized block ANOVA rejects the null hypothesis that the two 
Treatment means (jy and py) are equal. 


Questions 


13.3.1. Case Study 7.5.2 compared the volatility of Global 
Rock Funds’ return on investments to that of the bench- 
mark Lipper fund. But can it be said that the returns 
themselves beat the benchmark? The table below gives 
the annual returns of the Global Rock Fund for the years 
1989 to 2007 and the corresponding Lipper averages. Test 
the hypothesis that zp > 0 for these data at the 0.05 level 
of significance. 


Investment Investment 
return % return % 

Global Lipper Global Lipper 
Year Rock,x Avg,y Year Rock,x Avg., y 
1989 15.32 14.76 1999 27.43 34.44 
1990 1.62 —1.91 2000 8.57 1.13 
1991 28.43 20.67 2001 1.88 —3.24 
1992 11.91 6.18 2002 —7.96 —8.11 
1993 20.71 22.97 2003 35.98 32.57 
1994 —2.15 —2.44 2004 14.27 15.37 
1995 23.29 20.26 2005 10.33 11.25 
1996 15.96 14.79 2006 15.94 12.70 
1997 11.12 14.27 2007 16.71 9.65 
1998 0.37 6.25 


19 
> d? =370.8197 


19 
Note that )° dj = 28.17 


i=1 i=1 


13.3.2. Recall the depth perception data described in 
Question 8.2.6. Use a paired ¢ test with a = 0.05 to 
compare the numbers of trials needed to learn depth 
perception for Mothered and Unmothered lambs. 


13.3.3. Blood coagulates as a result of a complex 
sequence of chemical reactions. The protein thrombin trig- 
gers the clotting of blood under the influence of another 
protein called prothrombin. One measure of a person’s 
blood clotting ability is expressed in prothrombin time, 
which is defined to be the interval between the initiation of 
the thrombin-prothrombin reaction and the formation of 
the clot. One study (209) looked at the effect of aspirin on 
prothrombin time. The preceding table gives, for each of 
twelve subjects, the prothrombin time (in seconds) before 
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and three hours after taking two aspirin tablets (650 mg). 
Test the hypothesis that aspirin influences prothrombin 
times. Perform the test at both the wa = 0.05 and a =0.01 
levels. 


Subject Before Aspirin, x After Aspirin, y 
1 12.3 12.0 
2 12.0 12.3 
3 12.0 12.5 
4 13.0 12.0 
5 13.0 13.0 
6 12.5 12.5 
7 11.3 10.3 
8 11.8 11.3 
9 11.5 11.5 

10 11.0 11.5 
11 11.0 11.0 
12 11.3 11.5 


13.3.4. Use a paired f test to analyze the hypnosis/ESP 
data given in Question 13.2.1. Let a=0.05. 


13.3.5. Perform the hypothesis test indicated in Question 
13.2.3 at the 0.05 level using a paired t¢ test. Compare the 
square of the observed tf with the observed F’. Do the same 
for the critical values associated with the two procedures. 
What would you conclude? 


13.3.6. Let D,, Do,..., D, be the within-block differences 
as defined in this section. Assume that the D,’s are normal 
with mean pp and variance ops for i=1,2,...,b. Derive 
a formula for a 100(1 — a)% confidence interval for 1p. 
Apply this formula to the data of Case Study 13.3.1 and 
construct a 95% confidence interval for the true average 
hemoglobin difference (“before walk” — “after walk”). 


13.3.7. Construct a 95% confidence interval for jzp in the 
prothrombin time data described in Question 13.3.3. See 
Question 13.3.6. 


13.3.8. Show that the paired ¢ test is equivalent to the 
F test in a randomized block design when the number of 
treatment levels is two. (Hint: Consider the distribution of 


T? =bD /S?,.) 


13.4 Taking a Second Look at Statistics (Choosing 
between a Two-Sample t Test and a Paired t Test) 


Suppose that the means jry and jy associated with two treatments X and Y are to 
be compared. Theoretically, two “design” options are available: 


1. test Ho: wx = fy with independent samples (using Theorem 9.2.2 or Theorem 


9.2.3) or 


2. test Ho: Up =0 with dependent samples (using Theorem 13.3.1). 
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Example 


13.4.1 


Does it make a difference which design is used? Yes. Which one is better? That 
depends on the nature of the subjects, and how likely they are to respond to the 
treatments —neither design is always superior to the other. 

The two hypothetical examples described in this section illustrate the pros and 
cons of each approach. In the first case, the paired-data model is clearly preferable; 
in the second case, jzx and zy should be compared using a two-sample format. 


Comparing two weight loss plans 

Suppose that Treatment X and Treatment Y are two diet regimens. A comparison 
of the two is to be done by looking at the weight losses recorded by subjects who 
have been using one of the two diets for a period of three months. Ten people have 
volunteered to be subjects. Table 13.4.1 gives the gender, age, height, and initial 
weight for each of the ten. 


Table 13.4.1 

Subject Gender Age Height Weight (in pounds) 
HM M 65 5'8" 204 
HW F 41 5'4” 165 
JC M 23 6'0" 260 
AF F 63 5'3” 207 
DR F 59 5/2" 192 
WT M 22 6'2" 253 
SW F 19 51" 178 
LT F 38 5/5" 170 
TB M 62 57" 212 
KS F 23 5/3" 195 


Option A: Compare Diet X and Diet Y Using Independent Samples If the two- 
sample design is to be used, the first step would be to divide the ten subjects at 
random into two groups of size 5. Table 13.4.2 shows one such set of independent 
samples. 


Table 13.4.2 
Diet X Diet Y 
HW (F, middle-aged, slightly overweight) JC (M, young, very overweight) 
AF (elderly, very overweight) WT (M, young, very overweight) 
SW (EF, young, very overweight) HM (M, elderly, quite overweight) 
TB (M, elderly, quite overweight) KS_ (EF young, very overweight) 
DR (K elderly, very overweight) LT (F middle-aged, slightly overweight) 


Notice that each of the two samples contains individuals who are likely to 
respond very differently to whichever diet they are on simply because of the huge 
disparities in their physical profiles. Included among the subjects representing Diet 
X, for example, are HW and TB; HW is a slightly overweight, middle-aged female, 
while TB is a quite overweight, elderly male. More than likely, their weight losses 
after three months will be considerably different. 
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If some of the subjects in Diet X lose relatively few pounds (which will proba- 
bly be the case for HW) while others record sizeable reductions (which is likely to 
happen for AF, SW, and DR, all of whom are initially very overweight), the effect 
will be to inflate the numerical value of s%. Similarly, the value of s} will be inflated 
by the inherent differences among the subjects in Diet Y. 

Now, recall the formula for the two-sample r statistic, 


a ea 


= Slam 


If s; and s; are large, s, will also be large. But if s,, (in the denominator of the t ratio) 
is very large, the f statistic itself may be fairly small even if x — y is substantially 
different from zero—that is, the considerable variation within the samples has the 
potential to “obscure” the variation between the samples (as measured by ¥ — y). 
In effect, Ho: ux = uy might not be rejected (when it should be) only because the 
variation from subject to subject is so large. 


Option B: Compare Diet X and Diet Y Using Dependent Samples The same dif- 
ferences from subject to subject that undermine the two-sample t test provide some 
obvious criteria for setting up a paired t test. Table 13.4.3 shows a grouping into five 
pairs of the ten subjects profiled in Table 13.4.2, where the two members of each pair 
are as similar as possible with respect to the amount of weight they are likely to lose: 
for example, Pair 2—(JC, WT)—is comprised of two very overweight, young males. 
In the terminology of Equation 13.3.1, the Bz that measures the subject effect of 
persons fitting that description will be present in the weight losses reported by both 
JC and WT. When their responses are subtracted, d; = x2 — y, will, in effect, be free 
of the subject effect and will be a more precise estimate of the intrinsic difference 
between the two diets. It follows that differences between the pairs—no matter how 
sizeable those differences may be—are irrelevant because the comparisons of Diet 
X and Diet Y (that is, the d;’s) are made within the pairs, and then pooled from pair 
to pair. 


Table 13.4.3 
Pair Characteristics 
(HW, LT) Female, middle-aged, slightly overweight 
(JC, WT) Male, young, very overweight 
(SW, KS) Female, young, very overweight 
(HM, TB) Male, elderly, quite overweight 
(AF, DR) Female, elderly, very overweight 


The potential benefit here of using a paired-data design should be readily 
apparent. Recall that the paired ¢ statistic has the form 


bx d _ xy 
~ sp/Vb sp/vb 


For the reasons just cited, sp //5 is likely to be much smaller than the two-sample 


(13.4.1) 


Sp\/ 4+, thus reducing the likelihood that the paired f test’s denominator will 
“wash out” its numerator. 
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Example 


13.4.2 


Comparing two eye surgery techniques 

Suppose the ten subjects profiled in Table 13.4.2 are all nearsighted and have volun- 
teered to participate in a clinical trial comparing two laser surgery techniques. The 
basic plan is to use Surgery X on five of the subjects, and Surgery Y on the other 
five. A month later, each participant will be asked to rate (on a scale of 0 to 100) his 
or her satisfaction with the operation. a 


Option A: Compare Surgery X and Surgery Y Using Independent Samples 
Unlike the situation encountered in the diet study, none of the information recorded 
on the volunteers (gender, age, height, weight) has any bearing on the measurements 
to be recorded here: a very overweight, young male is no more or no less likely to 
be satisfied with corrective eye surgery than is a slightly overweight, middle-aged 
female. That being the case, there is no way to group the ten subjects into five pairs 
in such a way that the two members of a pair are uniquely similar in terms of how 
they are likely to respond to the satisfaction question. 

To compare Surgery X and Surgery Y, then, using the two-sample format, we 
would simply divide the ten subjects—at random—into two groups of size 5, and 
choose between Ho: wx = wy and Ai: 1x # ly on the basis of the two-sample t 
statistic, which would have 8 (=n +m—2=5+5—2) degrees of freedom. 


Option B: Compare Surgery X and Surgery Y Using Dependent Samples Given 
the absence of any objective criteria for linking one subject with another in any 
meaningful way, the pairs would have to be formed at random. Doing that would 
have some serious negative consequences that would definitely argue against using 
the paired-data format. Suppose, for example, HW was paired with LT, as was the 
case in the diet study. Since the information in Table 13.4.2 has nothing to do with 
a person’s reaction to eye surgery, subtracting LT’s response from HW’s response 
would not eliminate the “subject” effect as it did in the diet study, because the 
“personal” contribution of LT to the observed x could be entirely different than 
the “personal” contribution of HW to the observed y. In general, the within-pair 
differences—d; = x; — y;,i=1,2,...,5—would still reflect the subject effects, so the 
value of sp would not be reduced (relative to s,) as it was in the diet study. 

Is a lack of reduction in the magnitude of sp a serious problem? Yes, because 
the paired-data format intentionally sacrifices degrees of freedom for the express 
purpose of reducing sp. If the latter does not occur, those degrees of freedom are 
wasted. Here, given a total of ten subjects, a two-sample t test would have 8 degrees 
of freedom (=n +m —2=5+5—2); a paired t test would have 4 degrees of freedom 
(=b—1=5-—1). When at test has fewer degrees of freedom, the critical values for 
a given level of significance move farther away from zero, which means that the test 
with the smaller number of degrees of freedom will have a greater probability of 
committing a Type IJ error. 

Table 13.4.4 shows a comparison of the two-sided critical values for ¢ ratios with 
4 degrees of freedom and with 8 degrees of freedom for a equal to either 0.10, 0.05, 


Table 13.4.4 

a Cu /2,4 loy/2,8 
0.10 2.1318 1.8595 
0.05 2.7764 2.3060 
0.01 4.6041 2.3554 
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or 0.01. Clearly, the same value of x — y that would reject Ho: wx = wy with a tr test 
having 8 df may not be large enough to reject Ho: 1p = 0 with a ¢ test having 4 df. 


Appendix 13.A.1 Minitab Applications 


Figure 13.A.1.1 


To produce the information in a randomized block ANOVA table, Minitab uses the 
command TWOWAY C1 C2 C3. First, the data are “stacked,” treatment level over 
treatment level, into a single column —say, cl (similar to the way the y;;’s in a Tukey 
analysis are entered). Then two auxiliary columns must be created. The first, call it 
c2, gives the column number for each entry in cl. The second—say, c3—gives the 
block number (i.e., the row number) for each entry in cl. 

Consider, again, the data in Case Study 13.2.1. Figure 13.A.1.1 is the Minitab 
syntax for outputting the calculations that appear in Table 13.2.4. Notice that the 
Windows version reverses columns C1 and C2. 


MTB > set cl 

DATA > 8 11 9 16 2421121119 -206 2 11 
DATA > end 

MTB > set c2 
DATA>11111222223333383 

DATA > end 

MTB > set c3 
DATA>123451234512345 

DATA > end 

MTB > name c1 ’HAT’ c2 ’Therapy’ c3 ’Blocks’ 
MTB > twoway ci c2 c3 


Two-way ANOVA: HAT versus Therapy, Blocks 


Analysis of variance for HAT 


Source DF ss MS F P 
Therapy 2 260.93 130.47 15.26 0.002 
Blocks 4 438.00 109.50 12.81 0.001 
Error 8 68.40 8.55 

Total 14 8767.33 


Doing a Randomized Block Analysis of Variance Using Minitab Windows 


1. Enter the entire data set in column Cl, beginning with Treatment level 1, 
followed by Treatment level 2, and so on. 

2. In column C2, enter the block number of each data point in C1; in column C3, 
enter the column number of each data point in C1. 

3. Click on STAT, then on ANOVA, then on TWO-WAY. 

4. Type Cl in RESPONSE box, C2 in ROW FACTOR box, and C3 in COLUMN 
FACTOR box. 

5. Click on OK. 


There is no special command in Minitab for doing a paired ¢ test, but none 
is necessary. The appropriate P-value can be found by simply applying the (one- 
sample) MTB > ttest command to the within-pair differences (and setting ju, 
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equal to 0). Figure 13.A.1.2 shows the syntax for doing the paired rf test on the 
aerobics data described in Case Study 13.3.1. 


Figure 13.A.1.2 MTB > set cil 
DATA > 14.6 17.3 10.9 12.8 16.6 12.2 11.2 15.4 14.8 16.2 
DATA > end 
MTB > set c2 
DATA > 13.8 15.4 11.3 11.6 16.4 12.6 11.8 15.0 14.4 15.0 
DATA > end 
MTB > let c3 = cl - c2 
MTB > name c3’di’ 
MTB > ttest 0 c3 


One-Sample T: di 


Test of mu = 0 vs not = 0 
Variable N Mean StDev SE Mean 95% CI 4¥ P 
di 10 0.470 0.814 0.257 (-0.112, 1.052) 1.83 0.101 
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Wilcoxon Tests Parametric and Nonparametric Procedures) 
The Kruskal-Wallis Test Appendix 14.A.1 Minitab Applications 
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Power Power 


Critical to the justification of replacing a parametric test with a nonparametric test is 
a comparison of the power functions for the two procedures. The figures above 
illustrate the types of information that researchers have compiled-shown are the 
power functions of the one-sample t test (solid line) and the sign test (dashed lines) 
for three different sets of hypotheses, various degrees of nonnormality, a sample size 
of 10, and a level of significance of 0.05. (The parameter p, measures the shift from 
Ho to Hi; k3 and x4 measure the extent of nonnormality in the sampled population.) 
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14.1 Introduction 


Behind every confidence interval and hypothesis test we have studied thus far have 
been very specific assumptions about the nature of the pdf that the data presum- 
ably represent. For instance, the usual Z test for proportions— Ho:px = py versus 
H:px # py —is predicated on the assumption that the two samples consist of inde- 
pendent and identically distributed Bernoulli random variables. The most common 
assumption in data analysis, of course, is that each set of observations is a random 
sample from a normal distribution. This was the condition specified in every f test 
and F test that we have done. 

The need to make such assumptions raises an obvious question: What changes 
when these assumptions are not satisfied? Certainly the statistic being calculated 
stays the same, as do the critical values that define the rejection region. What does 
change, of course, is the sampling distribution of the test statistic. As a result, the 
actual probability of committing, say, a Type I error will not necessarily equal the 
nominal probability of committing a Type I error. That is, if W is the test statistic 
with pdf fiy(w | Ho) when Hp is true, and C is the critical region, 


“true” a =| tw(w|Ho)dw 
Cc 


is not necessarily equal to the “nominal” a, because fw (w | Ho) is different (because 
of the violated assumptions) from the presumed sampling distribution of the 
test statistic. Moreover, there is usually no way to know the “true” functional 
form of fw(w|Ho) when the underlying assumptions about the data have not 
been met. 

Statisticians have sought to overcome the problem implicit in not knowing the 
true fw(w|Hpo) in two very different ways. One approach is the idea of robustness, 
a concept that was introduced in Section 7.4. The Monte Carlo simulations illus- 
trated in Figure 7.4.6, for example, show that even though a set of Y;’s deviates from 
normality, the distribution of the f ratio, 


_Y- bo 


—s/Jn 


t 


is likely to be sufficiently close to f7,_, (ft) that the true a, for all practical purposes, 
is about the same as the nominal a. The one-sample f test, in other words, is often 
not seriously compromised when normality fails to hold. 

A second way of dealing with the additional uncertainty introduced by violated 
assumptions is to use test statistics whose pdfs remain the same regardless of how the 
population sampled may change. Inference procedures having this sort of latitude 
are said to be nonparametric or, more appropriately, distribution free. 

The number of nonparametric procedures proposed since the early 1940s has 
been enormous and continues to grow. It is not the intention of Chapter 14 to survey 
this multiplicity of techniques in any comprehensive fashion. Instead, the objec- 
tive here is to introduce some of the basic methodology of nonparametric statistics 
in the context of problems whose “parametric” solutions have already been dis- 
cussed. Included in that list will be nonparametric treatments of the paired-data 
problem, the one-sample location problem, and both of the analysis of variance 
models covered in Chapters 12 and 13. 


Theorem 
14.2.1 
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14.2 The Sign Test 


Probably the simplest—and most general—of all nonparametric procedures is the 
sign test. Among its many applications, testing the null hypothesis that the median 
of a distribution is equal to some specific value is perhaps its most important. 

By definition, the median, jf, of a continuous pdf fy(y) is the value for which 
PY <mw=P(Y=>p)= ‘. Suppose a random sample of size n is taken from fy(y). 
If the null hypothesis Ho: 4 = [io is true, the number of observations, X, exceeding 
{to is a binomial random variable with p = P(Y; > flo) = s. Moreover, E(X) =n/2, 
Var(X) =n-5-4=n/4, and a would have approximately a standard normal 
distribution (by virtue of the DeMoivre-Laplace theorem), provided n is sufficiently 
large. Intuitively, values of X too much larger or too much smaller than n/2 would 
be evidence that fi ¥ 0. 


Let y, y2, .--, Yn be a random sample of size n from any continuous distribution 

having median ji, where n> 10. Let k denote the number of y;’s greater than [19, and 
— k=n/2 

letz= Taf 


a. To test Ho: (t= [to versus Hy: ft > [to at the a level of significance, reject Ho if z> Zu. 

b. To test Ho: jt = [to versus Hi: ft < [to at the a level of significance, reject Ho if 
ZS SH Zy. 

c. To test Ho: ft = ftp versus Hy: jt F [lo at the a level of significance, reject Ho if z is 
either (1) < —Zw/2 Or (2) > Za/2. 


Comment Sign tests are designed to draw inferences about medians. If the under- 
lying pdf being sampled, though, is symmetric, the median is the same as the mean, 
so concluding that 4 4 {ip is equivalent to concluding that pw ¥ fio. 


Case Study 14.2.1 


Synovial fluid is the clear, viscid secretion that lubricates joints and tendons. 
Researchers have found that certain ailments can be diagnosed on the basis of 
a person’s synovial fluid hydrogen-ion concentration (pH). In healthy adults, 
the median pH for synovial fluid is 7.39. Listed in Table 14.2.1 are the pH values 
measured from fluids drawn from the knees of forty-three patients with arthritis 
(181). Does it follow from these data that synovial fluid pH can be useful in 
diagnosing arthritis? 

Let j« denote the median synovial fluid pH for adults suffering from 
arthritis. Testing 


Ho: & =7.39 
versus 
My: h £7.39 


then becomes a way of quantifying the potential usefulness of synovial fluid pH 
as a way of diagnosing arthritis. 

By inspection, a total of k=4 of then =43 y;’s exceed fp =7.39. Leta=0.01. 
The test statistic is 


4-4/2 gay 
any 77 aaa 


(Continued on next page) 
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(Case Study 14.2.1 continued) 


Table 14.2.1 
Subject Synovial Fluid pH Subject Synovial Fluid pH 

HW 7.02 BG 7.34 
AD 7.35 GL 7.22 
TK 7.32 BP 7.32 
EP 7.33 NK 7.40 
AF 7.15 LL 6.99 
LW 7.26 KC 7.10 
LT 7.25 FA 7.30 
DR 7.35 ML 7.21 
VU 7.38 CK 7.33 
SP 7.20 LW 7.28 
MM 7.31 ES 7.35 
DF 7.24 DD 7.24 
LM 7.34 SL 7.36 
AW 7.32 RM 7.09 
BB 7.34 AL 7.32 
TL 7.14 BV 6.95 
PM 7.20 WR 7.35 
JG 7.A1 HT 7.36 
DH 7.77 ND 6.60 
ER 7.12 SJ 7.29 
DP 7.45 BA 7.31 
FF 7.28 


which lies well past the left-tail critical value (= —Zo/2 = —Zo.005 = —2.58). It fol- 
lows that Ho: 4 = 7.39 should be rejected, a conclusion suggesting that arthritis 
should be added to the list of ailments that can be detected by the pH of a 
person’s synovial fluid. 


A Small-Sample Sign Test 


If n < 10, the decision rules given in Theorem 14.2.1 for testing Hp: 4 = fio are 
inappropriate because the normal approximation is not entirely adequate. Instead, 
decision rules need to be determined using the exact binomial distribution. 


Case Study 14.2.2 


Instant coffee can be formulated several different ways—freeze-drying and 
spray-drying being two of the most common. From a health standpoint, the 
most important difference from method to method is the amount of caffeine 
that is left as a residue. It has been shown that the median amount of caffeine 
left by the freeze-drying method is 3.55 grams per 100 grams of dry matter. 


(Continued on next page) 
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Listed in Table 14.2.2 are the caffeine residues recorded for eight brands of 
coffee produced by the spray-dried method (182). 


Table 14.2.2 
Brand Caffeine Residue (gms/100 gms dry weight) 
A 48 
B 4.0 
C 3.8 
D 4.3 
E 3.9 
F 4.6 
G 3.1 
H 3.7 


If ~ denotes the median caffeine residue characteristic of the spray-dried 
method, we compare the two methods by testing 


Ho: fi =3.55 
versus 
Ay: i £3.55 


By inspection, k =7 of the n = 8 spray-dried brands left caffeine residues 
in excess of jig = 3.55. Given the discrete nature of the binomial distribution, 
simple decision rules yielding specific a values are not likely to exist, so small- 
sample tests of this sort are best couched in terms of P-values. Figure 14.2.1 
shows Minitab’s printout of the binomial pdf when n = 8 and p= 5. Since Hy 
here is two-sided, the P-value associated with k = 7 is the probability that the 
corresponding binomial random variable would be greater than or equal to 7 
plus the probability that it would be less than or equal to 1. That is, 


P-value = P(X >7)+ P(X <1) 
= P(X =7)+ P(X =8)+ P(X=0)4+ P(X =1) 
= 0.031250 + 0.003906 + 0.003906 + 0.031250 
= 0.070 


The null hypothesis, then, can be rejected for any a > 0.07. 


MTB > pdf; 
SUBC > binomial 8 0.5. 


Probability Density Function 
Binomial with n=8 and p=0.5 
P(X=x) 

0.003906 

0.031250 

0.109375 

0.218750 

0.273438 

0.218750 

0.109375 

0.031250 

0.003906 


Figure 14.2.1 
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Using the Sign Test for Paired Data 


Suppose a set of paired data— (x1, y,), (%2, y2),.--, (%», yy) —has been collected, and 
the within-pair differences—d; =x; — y; i=1,2,...,b—have been calculated (recall 
Theorem 13.3.1). The sign test becomes a viable alternative to the paired r test if 
there is reason to believe that the d;’s do not represent a random sample from a 
normal distribution. Let 


p=P(X%;>Yji), i=1,2,...,b 


The null hypothesis that the x;’s and y;’s are representing distributions with the same 
median is equivalent to the null hypothesis Ho: p = 4. 

In the analysis of paired data, the generality of the sign test becomes especially 
apparent. The distribution of X; need not be the same as the distribution of Y;, nor 
do the distributions of X; and X; or Y; and Y; need to be the same. Furthermore, 
none of the distributions has to be symmetric, and they could all have different 
variances. The only underlying assumption is that X and Y have continuous pdfs. The 
null hypothesis, of course, adds the restriction that the median of the distributions 
within each pair be equal. 

Let U denote the number of (x;, y;) pairs for which d; = x; — y; > 0. The statistic 


appropriate for testing Ho: p= 3 is either an approximate Z ratio, Tce or the value 


of U itself, which has a binomial distribution with parameters b and 5 (when the null 
hypothesis is true). As before, the normal approximation is adequate if b > 10. 


Case Study 14.2.3 


One reason frequently cited for the mental deterioration often seen in the 
very elderly is the reduction in cerebral blood flow that accompanies the aging 
process. Addressing that concern, a study was done (5) in a nursing home to 
see whether cyclandelate, a drug that widens blood vessels, might be able to 
stimulate cerebral circulation and retard the onset of dementia. 

The drug was given to eleven subjects on a daily basis. To measure its phys- 
iological effect, radioactive tracers were used to determine each subject’s mean 
circulation time (MCT) at the start of the experiment and four months later, 
when the regimen was discontinued. [The MCT is the length of time (in sec) it 
takes blood to travel from the carotid artery to the jugular vein.] Table 14.2.3 
summarizes the results. 

If cyclandelate has no effect on cerebral circulation, p = P(X; > Y;) = 5. 
Moreover, it seems reasonable here to discount the possibility that the drug 
might be harmful, which means that a one-sided alternative is warranted. To be 
tested, then, is 


Ab: p= 5 

versus 

Hy: p> 5 
where H, is one-sided to the right because increased cerebral circulation would 
result in the MCT being reduced, which would produce more patients for whom 
x; was larger than y;. 


(Continued on next page) 
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Table 14.2.3 

Subject Before,x; After, y; x; > y;? 
J.B. 15 13 yes 
M.B. 12 8 yes 
A.B. 12 12.5 no 
M.B. 14 12 yes 
JL. 13 12 yes 
S.M. 13 12.5 yes u=9 
M.M. 13 12.5 yes 
S.McA. 12 14 no 
A.McL. 12.5 12 yes 
ES. 12 11 yes 
PW. 12.5 10 yes 


As Table 14.2.3 indicates, the number of subjects showing improvement in 
their MCTs was u = 9 (as opposed to the Hp expected value of 5.5). Let a=0.05. 
Since n= 11, the normal approximation is adequate, and Hp should be rejected if 


u-b/2 ‘cea 
Zy = 20.05 = 1. 
Vb/4 rae 0.05 
But 
ahi 92 
WSU 3 2p iG 
V/b/4 u 


so the evidence here is fairly convincing that cyclandelate does speed up 
cerebral blood flow. 


Questions 


14.2.1. Recall the data in Question 8.2.9 giving the sizesof MTB > pdf; © 
10 gorilla groups studied in the Congo. Is it believable that SUBC > binomial 10 0.5. 


the true median size, j1, of all such groups is 9? Answer 
the question by finding the P-value associated with the 


Probability Density Function 


null hypothesis Ho: / =9. Assume that H, is two-sided. Binomial with n = 10 and p = 0.5 


(Note: Tabulated on the right is the binomial pdf for the 
case where n= 10 and p=}.) 


P( X =x ) 
.000977 
.009766 
.043945 
.117188 
.205078 
.246094 
.205078 
.117188 
.043945 
.009766 
.000977 


SCODMIDUABWNROK 
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14.2.2. Test Ho: & =0.12 versus H;: 4 < 0.12 for the 
release chirp data given in Question 8.2.12. Compare the 
P-value associated with the large-sample test described 
in Theorem 14.2.1 with the exact P-value based on the 
binomial distribution. 


14.2.3. Below are n = 50 observations generated by 
Minitab’s RANDOM command that are presumably a 
random sample from the exponential pdf, f(y) =e’, 
y > 0. Use Theorem 14.2.1 to test whether the difference 
between the sample median for these y,’s (= 0.604) and 
the true median of f(y) is statistically significant. Let 
a =0.05. 


0.27187 0.46495 0.19368 0.80433 1.25450 0.62962 1.88300 
1.31951 2.53918 1.21187 0.95834 0.49017 0.87230 0.88571 
1.41717 1.75994 0.60280 2.19654 0.00594 4.11127 0.24130 
0.16473 0.08178 1.01424 0.60511 0.87973 0.06127 0.24758 
0.54407 0.05267 0.75210 0.13538 0.42956 0.02261 1.20378 
1.09271 1.88705 0.17500 0.50194 0.52122 0.02915 0.27348 
0.08916 0.72997 0.37185 0.06500 1.47721 4.02733 0.64003 
0.05603 


14.2.4. Let Y,, Y2,..., Yx. be arandom sample of normally 
distributed random variables with an unknown mean ju 
and a known variance of 6.0. We wish to test 


Ab: L= 10 
versus 
A,: w> 10 


Construct a large-sample sign test having a Type I error 
probability of 0.05. What will the power of the test be if 
pw=11? 


14.2.5. Suppose that n =7 paired observations, (X;, Y;), 
are recorded, i = 1,2,...,7. Let p= P(Y; > X;). Write 
out the entire probability distribution for Y,, the num- 
ber of positive differences among the set of Y; — X;’s, 


i=1,2,...,7, assuming that p = i What a levels are 
possible for testing Hy: p= + versus H;: p > 4? 


14.2.6. Analyze the Shoshoni rectangle data (Case 
Study 7.4.2) with a sign test. Let a = 0.05. 


14.2.7. Recall the FEV,/VC data described in Ques- 
tion 5.3.2. Test Ho: 4 = 0.80 versus Hp: (4 < 0.80 using a 
sign test. Compare this conclusion with that of a f test of 
Ao: 4 =0.80 versus H;: uw < 0.80. Let a=0.10. Assume that 
o is unknown. 


14.2.8. Do a sign test on the ESP data in Question 13.2.1. 
Define H, to be one-sided, and let a=0.05. 


14.2.9. In a marketing research test, twenty-eight adult 
males were asked to shave one side of their face with one 
brand of razor blade and the other side with a second 
brand. They were to use the blades for seven days and then 
decide which was giving the smoother shave. Suppose that 
nineteen of the subjects preferred blade A. Use a sign test 
to determine whether it can be claimed, at the 0.05 level, 
that the difference in preferences is statistically significant. 


14.2.10. Suppose that a random sample of size 36, 
Y,, Y2,..., Y36, is drawn from a uniform pdf defined over 
the interval (0, 6), where 6 is unknown. Set up a large- 
sample sign test for deciding whether or not the 25th 
percentile of the Y-distribution is equal to 6. Let a =0.05. 
With what probability will your procedure commit a 
Type I error if 7 is the true 25th percentile? 


14.2.11. Use a small-sample sign test to analyze the aero- 
bics data given in Case Study 13.3.1. Use the binominal 
distribution displayed in Question 14.2.1. Let a = 0.05. 
Does your conclusion agree with the inference drawn 
from the paired ¢ test? 


14.3 Wilcoxon Tests 


Although the sign test is a bona fide nonparametric procedure, its extreme simplicity 
makes it somewhat atypical. The Wilcoxon signed rank test introduced in this section 
is more representative of nonparametric procedures as a whole. Like the sign test, 
it can be adapted to several different data structures. It can be used, for instance, as 
a one-sample test for location, where it becomes an alternative to the r test. It can 
also be applied to paired data, and with only minor modifications it can become a 
two-sample test for location and a two-sample test for dispersion (provided the two 
populations have equal locations). 


Testing Ho: 6 = Uo 


Let yj, y2,... 


,yn be a set of independent observations drawn from the pdfs 


fy, (y), fy,(y),---, fy,(), respectively, all of which are continuous and symmetric 
(but not necessarily the same). Let 4 denote the (common) mean of the fy, (y)’s. We 


wish to test 


Theorem 
14.3.1 
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Ho: “= Lo 

versus 

Ay: LF Lo 
where [Zo is some prespecified value for pu. 

For an application of this sort, the signed rank test is based on the magnitudes, 
and directions, of the deviations of the y;’s from jo. Let |y; — “ol, |y2 — Mol, ---s Yn — 
[4o| be the set of absolute deviations of the y;’s from j1o. These can be ordered from 
smallest to largest, and we can define r; to be the rank of | y; — 4o9|, where the smallest 
absolute deviation is assigned a rank of 1, the second smallest a rank of 2, and so on, 
up to n. If two or more observations are tied, each is assigned the average of the 
ranks they would have otherwise received. 

Associated with each r; will be a sign indicator, z;, where 


_ JO ify;—uo <9 
Os ae gO 


The Wilcoxon signed rank statistic, w, is defined to be the linear combination 


n 
W= ) ViZi 
i=1 


That is, w is the sum of the ranks associated with the positive deviations (from ju). 
If Ho is true, the sum of the ranks of the positive deviations should be roughly the 
same as the sum of the ranks of the negative deviations. 

To illustrate this terminology, consider the case where n =3 and y; = 6.0, yo =4.9, 
and y3 = 11.2. Suppose the objective is to test 


Ab: h= 10.0 
versus 
Hy: 1 ~ 10.0 


Note that |y; — 49] = 4.0, |y2 — wo] =5.1, and |y3 — vo| = 1.2. Since 1.2 < 4.0 < 5.1, it 
follows that r) = 2, r2 =3, and r3 = 1. Also, z; =0, z2 =0, and z3 = 1. Combining the 
r;’s and the z;’s we have that 


n 
w= ) V5 Zi 
i=1 


= (0)(2) + 0)(3) + A) () 
=1 


Comment Notice that w is based on the ranks of the deviations from zp and not on 
the deviations themselves. For this example, the value of w would remain unchanged 
if y2 were 4.9, 3.6, or —10, 000. In each case, r2 would be 3 and z2 would be 0. If the 
test statistic did depend on the magnitude of the deviations, it would have been 
necessary to specify a particular distribution for fy(y), and the resulting procedure 
would no longer be nonparametric. 


Let yi, y2,..-,; Yn be a set of independent observations drawn, respectively, from the 
continuous and symmetric (but not necessarily identical) pdfs fy,(y),i =1,2,...,n. 
Suppose that each of the fy,(y)’s has the same mean w. If Ho: 4 = [Xo is true, the pdf of 
the data’s signed rank statistic, pw(w), is given by 
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1 
pw(w)=P(W=w)= (=) -c(w) 


where c(w) is the coefficient of eW! in the expansion of 
n 
I] (1 +e'') 
i=l 
Proof The statement and proof of Theorem 14.3.1 are typical of many nonparamet- 
ric results. Closed-form expressions for sampling distributions are seldom possible: 
The combinatorial nature of nonparametric test statistics lends itself more readily 
to a generating function format. 
To begin, note that if Ho is true, the distribution of the signed rank statistic is 
n 
equivalent to the distribution of U = )> U;, where 
i=l 
0 with probability + 
|i with probability + 
Therefore, W and U have the same moment-generating function. Since the data are 


presumed to be a random sample, the U;’s are independent random variables, and 
from Theorem 3.12.3, 


My (t) = Mw (t) 


= (=) [ [CG +e") (14.3.1) 


Now, consider the structure of pw(w), the pdf for the signed rank statistic. In 
the formation of w, 7; can be prefixed by either a plus sign or zero; similarly for 
r2,13,..., and r,. It follows that since each r; can take on two different values, the 
total number of ways to “construct” signed rank sums is 2”. Under Hp, of course, 
all of those scenarios are equally likely, so the pdf for the signed rank statistic must 
necessarily have the form 


c(w) 
Qn 
where c(w) is the number of ways to assign pluses and zeros to the first n integers so 


pw(w) = PW =w) = (14.3.2) 


that 5° r;z; has the value w. 

i=l 

The conclusion of Theorem 14.3.1 follows immediately by comparing the form 
of pw(w) to Equation 14.3.1 and to the general expression for a moment-generating 


function. By definition, 


n(n+1)/2 
My(t)= E(e™’) = > ec” pw(w) 
w=1 
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but from Equations 14.3.1 and 14.3.2 we can write 


n(n+1)/2 1 n n(n+1)/2 c(w) 
iS e" pww)= (55) T] (0+ e") = Oy ew. a 


w=l1 i=1 w=1 


It follows that c(w) must be the coefficient of e”” in the expansion of [| (1 + e’’), and 
i=l 


the theorem is proved. 


Calculating py(w) 


A numerical example will help clarify the statement of Theorem 14.3.1. Suppose 
n= 4. By Equation 14.3.1, the moment-generating function for the signed rank 
statistic is the product 


1 t 1 2t 1 3t 1 At 
wo (S MS) VE) 


I 
= (=) {1 + ef + 6% + 2e% + 2e¥ + 20% + 2e% + 20% + e% + e% + el} 


16 


Thus, the probability that W equals, say, 2 is 4 (since the coefficient of e”' is 1); the 


probability that W equals 7 is % and so on. The first two columns of Table 14.3.1 
show the complete probability distribution of W, as given by the expansion of 
My(t). The last column enumerates the particular assignments of pluses and zeros 


that generate each possible value w. 


Tables of the cdf, Fy(w) 


Cumulative tail area probabilities, 


wi n(n+1)/2 
P(W<wi)= YS pw(w) and P(W>w3)= SS Pw(w) 
w=0 w=w; 


are listed in Table A.6 of the Appendix for sample sizes ranging from n= 4 ton = 12. 
[Note: The smallest possible value for w is 0, and the largest possible value is the 
sum of the first n integers, n(n + 1)/2.] Based on these probabilities, decision rules 
for testing Ho: 4 = (Wo can be easily constructed. For example, suppose n = 7 and we 
wish to test 


Ao: & = Lo 
versus 
Ay: LF Lo 


at the aw =0.05 level of significance. The critical region would be the set of w values 
less than or equal to 2 or greater than or equal to 26—that is, C={w: w <2 or w > 26}. 
That particular choice of C follows by inspection of Table A.6, because 


Y~ pw(w) =0.023 + 0.023 = 0.05 


weC 
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Table 14.3.1 Probability Distribution of W 
Yj 

w Pw(w)= P(W=w) 1 2 3 4 
1 

0 — 0 0 0 0 
16 

1 : + 0 0 0 
16 
1 

2 = 0 0 0 
16 = 

5 2 + + 0 0 
16 0 0 + 0 

‘ 2 + 0 + 0 
16 0 0 0 + 

5 2 + 0 0 + 
16 0 + + 0 

p 2 + + + 0 
16 0 7 0 + 
2 + as 0 oF 

7 aa 
16 0 0 + + 
1 

8 = 0 
7 + + as 

9 = 0 
16 

10 es 
16 


Case Study 14.3.1 


Swell sharks (Cephaloscyllium ventriosum) are small, reef-dwelling sharks that 
inhabit the California coastal waters south of Monterey Bay. There is a second 
population of these fish living nearby in the vicinity of Catalina Island, but it 
has been hypothesized that the two populations never mix. In between Santa 
Catalina and the mainland is a deep basin, which, according to the “separation” 
hypothesis, is an inpenetrable barrier for these particular fish (66). 

One way to test this theory would be to compare the morphology of sharks 
caught in the two regions. If there were no mixing, we would expect a certain 
number of differences to have evolved. Table 14.3.2 lists the total length (TL), 
the height of the first dorsal fin (HDJ), and the ratio TL/HDI for ten male swell 
sharks caught near Santa Catalina. 

It has been estimated on the basis of past data that the true average TL/HDI 
ratio for male swell sharks caught off the coast is 14.60. Is that figure consistent 


(Continued on next page) 
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with the data of Table 14.3.2? In more formal terms, if 4. denotes the true mean 
TL/HDI ratio for the Santa Catalina population, can we reject Ho: 4. = 14.60, and 
thereby lend support to the separation theory? 

Table 14.3.3 gives the values of TL/HDI (= y,), y; — 14.60, |y; — 14.60], ri, zi, 
and r;z; for the ten Santa Catalina sharks. Recall that when two or more num- 
bers being ranked are equal, each is assigned the average of the ranks they 
would otherwise have received; here, | yg — 14.60] and |y,9 — 14.60] are both 
competing for ranks for 4 and 5, so each is assigned a rank of 4.5 [= (4+5)/2]. 


Table 14.3.2 Measurements Made on Ten Sharks Caught Near 
Santa Catalina 


Total Length (mm) Height of First Dorsal Fin(mm) TL/HDI 


906 68 13.32 
875 67 13.06 
771 55 14.02 
700 59 11.86 
869 64 13.58 
895 65 13.77 
662 49 13.51 
750 52 14.42 
794 55 14.44 
787 51 15.43 
Table 14.3.3 Computations for Wilcoxon Signed Rank Test 
TL/HDI(=y;) y,;—14.60 |y,-14.60| 7 2 nz 
13.32 —1.28 1.28 8 0 0 
13.06 —1.54 1.54 9 0 0 
14.02 —0.58 0.58 3 0 0 
11.86 —2.74 2.74 10 =~—O 0 
13.58 —1.02 1.02 6 0 0 
13.77 —0.83 0.83 45 0 0 
13.51 —1.09 1.09 7 0 0 
14.42 —0.18 0.18 2 0 0 
14.44 —0.16 0.16 1 0 0 
15.43 +0.83 0.83 45 1 4.5 


Summing the last column of Table 14.3.3, we see that w =4.5. According to 
Table A.6 in the Appendix, the wa = 0.05 decision rule for testing 


Ao: L= 14.60 
versus 
Hy: 1 # 14.60 


requires that Hp be rejected if w is either less than or equal to 8 or greater than 
or equal to 47. (Why is the alternative hypothesis two-sided here?) 

(Note: The exact level of significance associated with C = {w: w < 8 or w > 47} is 
0.024 + 0.024 = 0.048.) Thus we should reject Ho, since the observed w was less 
than 8. These particular data, then, would support the separation hypothesis. 
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About the Data If data came equipped with alarm bells, the measurements in 
Table 14.3.3 would be ringing up a storm. The cause for concern is the fact that the 
y;’s being analyzed are the quotients of random variables (TL/HDI). A quotient can 
be difficult to interpret. If its value is unusually large, for example, does that imply 
that the numerator is unusually large or that the denominator is unusually small, or 
both? And what does an “average” value for a quotient imply? 

Also troublesome is the fact that distributions of quotients sometimes violate 
critical assumptions that we typically take for granted. Here, for example, both 
TL and HDI might conceivably be normally distributed. /f they were indepen- 
dent standard normal random variables (the simplest possible case), their quotient 
Q =TL/HDI would have a Cauchy distribution with pdf 

fo(= 


——- ,-~ <q <oo 
m(1+q’) ; 

Although harmless looking, fo(q) has some highly undesirable properties: neither 
its mean nor its variance is finite. Moreover, it does not obey the central limit 
theorem —the average of a random sample from a Cauchy distribution, 


regi 
O=  (Ovt-Or tet Oy) 


has the same distribution as any single observation, Q; [see (92)]. Making matters 
worse, the data in Table 14.3.3 do not even represent the simplest case of a quotient 
of normal random variables—here the means and variances of both TL and HDI 
are unknown, and the two random variables may not be independent. 

For all these reasons, using a nonparametric procedure on these data is clearly 
indicated, and the Wilcoxon signed rank test is a good choice (because the assump- 
tions of continuity and symmetry are likely to be satisfied). The broader lesson, 
though, for experimenters to learn from this example is to think twice —maybe three 
times — before taking data in the form of quotients. 


Questions 

14.3.1. The average energy expenditures for eight elderly Average Daily Energy Expenditures (kcal) 
women were estimated on the basis of information 

received from a battery-powered heart rate monitor that Subject Summer, x; Winter, y; 


each subject wore. Two overall averages were calculated 


procedure to test 


for each woman, one for the summer months and one for 1 1458 1424 
the winter months (154), as shown in the following table. 2 1353 1501 
Let 4p denote the location difference between the sum- 3 2209 1495 
mer and winter energy expenditure populations. Compute 4 1804 1739 
y,; —x;,1=1,2,...,8, and use the Wilcoxon signed rank 5 1912 2031 
6 1366 934 
7 1598 1401 
8 1406 1339 
Ho: Up =0 14.3.2. Use the expansion of 
versus 
Ay: c 
Mp #0 []( +e") 


Let a=0.15. 


i=l 


to find the pdf of W when n=5. What a levels are available 
for testing Hp: 4 = [lo versus Hy: ft > flo? 
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A Large-Sample Wilcoxon Signed Rank Test 


The usefulness of Table A.6 in the Appendix for testing Ho: = uo is limited 
to sample sizes less than or equal to 12. For larger n, an approximate signed 
rank test can be constructed, using E(W) and Var(W) to define an approximate 
Z ratio. 


Theorem When Hp: 4 = [Lo is true, the mean and variance of the Wilcoxon signed rank statistic, 
14.3.2 W, are given by 


n(n+ 1) 


E(W)= q 


and 


n(n+ 1)Qn+ 1) 


Var(W) = 7A 


Also, for n > 12, the distribution of 


—[n(n+1)]/4 
J/[n (n+ 1) (n+ 1)]/24 


can be adequately approximated by the standard normal pdf, f7(z). 


Proof We will derive E(W) and Var(W); for a proof of the asymptotic normality, see 
(80). Recall that W has the same distribution as U =)° U;, where 


i=1 


0 with probability 4 
~ |i with probability 5 


Therefore, 
E(W)=E a SEW) 
i=1 

-y 0- st =>: 
7 a 7 i=l 2 
Pir 

4 
Similarly, 


Var(W) = Var(U) = )* Var(U;) 


i=1 
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since the U;’s are independent. But 


Var(U;) = E(U2) — [EWP 


Zz /i\? 2 
-5-(5) ae 


making 
ie 1\ [n(n+1)(2n+1) 
Var(W) = = 
Ga. 
_ n(n+ 1)Qn+1) 
7 24 
Theorem Let w be the signed rank statistic based on n independent observations, each drawn 
14.3.3 from a continuous and symmetric pdf, where n > 12. Let 


w —[n(n+1)]/4 
JVIn(n + 1) Qn + 1)]/24 


a. To test Ho: 4 = lo versus Hi: 4 > [Lo at the a level of significance, reject Ho if 
Z= Za: 

b. To test Ho: 4 = bo versus Hy: 1 < [lo at the a level of significance, reject Ho if 
ZS Jy: 

c. To test Ho: = ko versus Hy: 6 # [Wo at the a level of significance, reject Ho if z is 
either (1) < —Zqa/2 Or (2) > Ze/2. 


Case Study 14.3.2 


Cyclazocine, along with methadone, are two of the drugs widely used in the 
treatment of heroin addiction. Some years ago, a study was done (141) to 
evaluate the effectiveness of the former in reducing a person’s psychological 
dependence on heroin. The subjects were fourteen males, all chronic addicts. 
Each was asked a battery of questions that compared his feelings when he 
was using heroin to his feelings when he was “clean.” The resultant Q-scores 
ranged from a possible minimum of 11 to a possible maximum of 55, as 
shown in Table 14.3.4. (From the way the questions were worded, higher scores 
represented /ess psychological dependence.) 

The shape of the histogram for these data suggests that a normality assump- 
tion may not be warranted—the weaker assumption of symmetry is more 
believable. That said, a case can be made for using a signed rank test on these 
data, rather than a one-sample t test. 
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Table 14.3.4 Q-Scores of Heroin Addicts 
after Cyclazocine Therapy 
51 43 
53 45 
43 27 
36 21 
55 26 
55 22 
39 43 


The mean score for addicts not given cyclazocine is known from past expe- 
rience to be 28. Can we conclude on the basis of the data in Table 14.3.4 that 
cyclazocine is an effective treatment? 

Since high Q-scores represent /ess dependence on heroin (and assuming 
cyclazocine would not tend to worsen an addict’s condition), the alternative 
hypothesis should be one-sided to the right. That is, we want to test 


Ab: h= 28 
versus 
Ay: > 28 


Let a be 0.05. 

Table 14.3.5 details the computations showing that the signed rank statis- 
tic, w—that is, the sum of the r;z; column—equals 95.0. Since n = 14, E(W) = 
[14(14 + 1)]/4 = 52.5 and Var(W) = [14(14 + 1)(28 + 1)]/24 = 253.75, so the 
approximate Z ratio is 


_ 95.0— 52.5 _ 


2.67 
253.75 
Table 14.3.5 Computations to Find w 
Q-Score, y; (y;—28) |; — 28 rj Zi Zi 
Sill +23 23 11 1 i 
53 +25 25 12 1 12 
43 +15 15 8 1 8 
36 +8 8 5 1 5 
55 +27 27 13.5 1 13.5 
55 +27 27 13.5 1 13.5 
39 +11 11 6 1 6 
43 +15 15 8 1 8 
45 +17 17 10 1 10 
27 -1 1 1 0 O 
21 —7 7 4 0 0 
26 —2 2 2 0 0 
22 —6 6 3 0 0 
43 415 15 8 1 8. 
95.0 
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(Case Study 14.3.2 continued) 


The latter considerably exceeds the one-sided 0.05 critical value identified in 
part (a) of Theorem 14.3.3 (= zo.95 = 1.64), so the appropriate conclusion is to 
reject Hy) —it would appear that cyclazocine therapy is helpful in reducing heroin 
dependence. 


Testing Ho : 4p = 0 (Paired Data) 


A Wilcoxon signed rank test can also be used on paired data to test Ho: up = 0, 
where jp = [Lx — [Ly (recall Section 13.3). Suppose that responses to two treatment 
levels (X and Y) are recorded within each of n pairs. Let d; = x; — y; be the response 
difference recorded for Treatment X and Treatment Y within the ith pair, and let 7; 


be the rank of |x; — y;| in the set |x; — yq|, |x2 — yal, ..., |%n — yn|. Define 
1 if x; -—y; >0 
ee 
‘10. ifx;—y; <0 


n 
and let w= )°7;Z;. 
i=l 


If n < 12, critical values for testing Ho: up =0 are gotten from Table A.6 in the 
Appendix in exactly the same way that decision rules were determined for using the 
signed rank test on Ho: w= fo. If n > 12, an approximate Z test for Ho: 4p =0 can 
be carried out using the formulas given in Theorem 14.3.2. 


Case Study 14.3.3 


Until recently, all evaluations of college courses and instructors have been done 
in class using questionnaires that were filled out in pencil. But as administrators 
well know, tabulating those results and typing the students’ written comments 
(to preserve anonymity) take up a considerable amount of secretarial time. To 
expedite that process, some schools have considered doing evaluations online. 
Not all faculty support such a change, though, because of their suspicion that 
online evaluations might result in lower ratings (which, in turn, would affect 
their chances for reappointment, tenure, or promotion). 

To investigate the merits of that concern, one university (104) did a pilot 
study where a small number of instructors had their courses evaluated online. 
Those same teachers had taught the same course the previous year and had been 
evaluated the usual way in class. Table 14.3.6 shows a portion of the results. The 
numbers listed are the responses on a 1- to 5-point scale (“5” being the best) to 
the question “Overall Rating of the Instructor.” Here, x; and y; denote the ith 
instructor’s ratings “in-class” and “online,” respectively. 

To test Ho: up =0 versus Hy: 4p #0, where wp = Wy — Ly, at the a =0.05 
level of significance requires that Ho be rejected if the approximate Z ratio in 
Theorem 14.3.2 is either (1) < —1.96 or (2) > +1.96. But 
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w —[n(n+ 1)/4] — 70—[15(16)/4] 
Jinn+Dn+11/24 (15016) B81] /24 


so the appropriate conclusion is to “fail to reject Ho.” The results in Table 14.3.6 
are entirely consistent, in other words, with the hypothesis that the mode of 
evaluation—in-class or online—has no bearing on an instructor’s rating. 


=0.57 


Table 14.3.6 
Obs. # Instr. In-Class,x; Online, y; |x;—y;| 7 2 TiZi 
1 EF 4.67 4.36 0.31 7 1 7 
2 LC 3.50 3.64 0.14 3 0 0 
3 AM 3.50 4.00 0.50 11 O 0 
4 CH 3.88 3.26 0.62 12 41 12 
S) DW 3.94 4.06 0.12 2 0 0 
6 CA 4.88 4.58 0.30 6 1 6 
7 MP 4.00 3.52 0.48 10 1 10 
8 CP 4.40 3.66 0.74 13 1 13 
9 RR 4.41 4.43 0.02 1 O 0 
10 TB 4.11 4.28 0.17 4 0 0 
11 GS 3.45 4.25 0.80 15 0 0 
12 HT 4.29 4.00 0.29 5 1 5 
13 DW 4.25 5.00 0.75 14 0 0 
14 FE 4.18 3.85 0.33 8 1 8 
15 WD 4.65 4.18 0.47 9 1 9 

w=70 


About the Data Theoretically, the fact that all of the in-class evaluations were 
done first poses some problems for the interpretation of the ratings in Table 14.3.6. If 
instructors tend to receive higher (or lower) ratings on successive attempts to teach 
the same course, then the differences x; — y; would be biased by a time effect. How- 
ever, when instructors have already taught a course several times (which was true 
for the faculty included in Table 14.3.6), experience has shown that trends in future 
attempts are not what tend to happen—instead, ratings go up and down, seemingly 
at random. 


Testing Ho : ux = wy (The Wilcoxon Rank Sum Test) 


Another redefinition of the statistic w = > r;z; allows ranks to be used as a way of 


t 
testing the two-sample hypothesis, Ho: 4x = zy, where wx and jy are the means 
of two continuous distributions, fy(x) and fy(y). It will be assumed that fy (x) and 
fy (y) have the same shape and the same standard deviation, but they may differ with 
respect to location—that is, Y = X — c, for some constant c. When those restrictions 
are met, the Wilcoxon rank sum test can appropriately be used as a nonparametric 
alternative to the pooled two-sample t test. 
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Theorem 
14.3.4 


Let x1, %2,.-.,%_ and Yn41, Yn+2,-++-, Ynt¢m be two independent random samples 
of sizes n and m from fx(x) and fy(y), respectively. Define r; to be the rank of 
the ith observation in the combined sample (so r; ranges from 1 for the smallest 
observation to n + m for the largest observation). 

Let 


( if the ith observation came from fy (x) 
0 if the ith observation came from fy(y) 


and define 
n+m 
w’ = x, ViZi 
i=1 
Here, w’ denotes the sum of the ranks in the combined sample of the n observations 
coming from fx(x). Clearly, w’ is capable of distinguishing between Ho and A). If, 
for example, fx(x) has shifted to the right of fy(y), the sum of the ranks of the x 
observations would tend to be larger than if fy(x) and fy(y) had the same location. 
For small values of n and m, critical values for w’ have been tabulated [see, for 
example, (81)]. When n and m both exceed 10, a normal approximation can be used. 


Let x1, X2,...,Xn ANd Yn41, Yn42;-++> Yntm be two independent random samples from 
fx(x) and fy(y), respectively, where the two pdfs are the same except for a possible 
shift in location. Let r; denote the rank of the ith observation in the combined sample 
(where the smallest observation is assigned a rank of I and the largest observation, a 
rank ofn+m). Let 

n+m 

w= > ViZi 
i=l 


where z; is | if the ith observation comes from fx (x) and 0, otherwise. Then 


1 
rp ) 
2 
1 
Var(W’) = neem) — ) 


and 
W’—n(n+m+1)/2 
Jnm(n+m + 1)/12 

has approximately a standard normal pdf if n > 10 and m > 10. 


Proof See (102). 


Case Study 14.3.4 


In Major League Baseball, American League teams have the option of using a 
“designated hitter” to bat for a particular position player, typically the pitcher. 
In the National League, no such substitutions are allowed, and every player 
must bat for himself (or be removed from the game). As a result, batting and 
base-running strategies employed by National League managers are much dif- 
ferent than those used by their American League counterparts. What is not 
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so obvious is whether those differences in how games are played have any 
demonstrable effect on how long it takes games to be played. 

Table 14.3.7. shows the average home-game completion time (in minutes) 
reported by the twenty-six Major League teams for the 1992 season. The Amer- 
ican League average was 173.5 minutes; the National League average, 165.8 
minutes. Is the difference between those two averages statistically significant? 

The entry at the bottom of the last column is the sum of the ranks of 


26 
the American League times—that is, w’ = )> r;z; = 240.5. Since the American 


League and National League had n = 14 and m= 12 teams, respectively, in 1992, 
the formulas in Theorem 14.3.3 give 


141144 1241) 


E(W)= 5 = 189 
and 
Var(W") = 14.12(144+ 124+ 1) _ 378 
12 
Table 14.3.7 
Obs. # Team Time(min) 7 Zz Fi 
1 Baltimore 177 21 1 21 
2 Boston 177 21 1 21 
3 California 165 75 1 75 
4 Chicago (AL) 172 145 1 14.5 
5 Cleveland 172 145 1 14.5 
6 Detroit 179 245 1 24.5 
7 Kansas City 163 5 1 5 
8 Milwaukee 175 18 1 18 
9 Minnesota 166 95 1 9.5 
10 New York (AL) 182 26 1 26 
11 Oakland 177 21 1 21 
12 Seattle 168 12.5 1 12.5 
13 Texas 179 245 1 24.5 
14 Toronto 177 21 1 21 
15 Atlanta 166 95 O 0 
16 Chicago (NL) 154 1 0 0 
17 Cincinnati 159 2 0 0 
18 Houston 168 125 0 0 
19 Los Angeles 174 16.5 0 0 
20 Montreal 174 16.5 0O 0 
21 New York (NL) 177 21 0 0 
22 Philadelphia 167 11 0 0 
23 Pittsburgh 165 75 0 0 
24 San Diego 161 3.5 0 0 
25 San Francisco 164 6 0 0 
26 St. Louis 161 3.55 0 0 
w’ = 240.5 
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(Case Study 14.3.4 continued) 


value of 21.) 


The approximate Z statistic, then, is: 
__ w— EW’) _ 240.5— 189 


~ /Var(W) 
At the a = 0.05 level, the critical values for testing Ho: wy = wy versus H,: 
[ux # Wy would be +1.96. The conclusion, then, is to reject Hj —the difference 
between 173.5 and 165.8 is statistically significant. 

(Note: When two or more observations are tied, they are each assigned the 
average of the ranks they would have received had they been slightly different. 
There were five observations that equaled 177, and they were competing for 
the ranks 19, 20, 21, 22, and 23. Each, then, received the corresponding average 


= 2.65 
V 378 


Questions 


14.3.3. Two manufacturing processes are available for 
annealing a certain kind of copper tubing, the primary 
difference being in the temperature required. The criti- 
cal response variable is the resulting tensile strength. To 
compare the methods, fifteen pieces of tubing were bro- 
ken into pairs. One piece from each pair was randomly 
selected to be annealed at a moderate temperature, the 
other piece at a high temperature. The resulting tensile 
strengths (in tons/sq in.) are listed in the following table. 
Analyze these data with a Wilcoxon signed rank test. Use 
a two-sided alternative. Let a = 0.05. 


Tensile Strengths (tons/sq in.) 


Moderate High 
Pair Temperature Temperature 

1 16.5 16.9 
2 17.6 17.2 
3 16.9 17.0 
4 15.8 16.1 
=) 18.4 18.2 
6 17.5 17.7 
7 17.6 17.9 
8 16.1 16.0 
9 16.8 17.3 
10 15.8 16.1 
11 16.8 16.5 
12 17.3 17.6 
13 18.1 18.4 
14 17.9 17.2 
15 16.4 16.5 


14.3.4. To measure the effect on coordination associated 
with mild intoxication, thirteen subjects were each given 
15.7mL of ethyl alcohol per square meter of body sur- 
face area and asked to write a certain phrase as many 
times as they could in the space of one minute (119). 
The number of correctly written letters was then counted 
and scaled, with a scale value of 0 representing the score 
a subject not under the influence of alcohol would be 
expected to achieve. Negative scores indicate decreased 
writing speeds; positive scores, increased writing speeds. 


Subject Score Subject Score 

1 —6 8 0 
2 10 9 —7 
3 9 10 5 
4 —8 11 —9 
5 —6 12 —10 
6 —2 13 —2 
7 20 


Use the signed rank test to determine whether the level 
of alcohol provided in this study had any effect on writing 
speed. Let a =0.05. Omit Subject 8 from your calculations. 


14.3.5. Test Ho: 4 = 0.80 versus H;: 4 < 0.80 for the 
FEV,/VC ratio data of Question 5.3.2 using a Wilcoxon 
signed rank test. Let a = 0.10. Compare this test to the 
sign test of Question 14.2.7. 


14.3.6. Do a Wilcoxon signed rank test on the 
hemoglobin data summarized in Case Study 13.3.1. Let a 


be 0.05. Compare your conclusion with the outcome of the 
sign test done in Question 14.2.11. 


14.3.7. Suppose that the population being sampled is 
symmetric and we wish to test Ho: jt = to. Both the sign 
test and the signed rank test would be valid. Which 
procedure, if either, would you expect to have greater 
power? Why? 


14.3.8. Use a signed rank test to analyze the depth 
perception data given in Question 8.2.6. Let a =0.05. 
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14.3.9. Recall Question 9.2.6. Compare the ages at death 
for authors noted for alcohol abuse with the ages at death 
for authors not noted for alcohol abuse using a Wilcoxon 
rank sum test. Let a =0.05. 


14.3.10. Use a large-sample Wilcoxon rank sum test to 
analyze the alpha wave data summarized in Table 9.3.1. 
Let a =0.05. 


14.4 The Kruskal-Wallis Test 


The next two sections of this chapter discuss the nonparametric counterparts for the 
two analysis of variance models introduced in Chapters 12 and 13. Neither of these 
procedures, the Kruskal-Wallis test and the Friedman test, will be derived. We will 
simply state the procedures and illustrate them with examples. 

First, we consider the k-sample problem. Suppose that k(> 2) independent ran- 
dom samples of sizes n;,n2,...,nx are drawn, representing k continuous populations 
having the same shape but possibly different locations: fy,(y — c1) = fy,(y — c2) = 

..= fy,(y — cx), for constants c), c2,...,cx. The objective is to test whether the 
locations of the fy,(y)’s, j=1,2,...,k, might all be the same —that is, 


Ao: Wy = 2 =... = WK 


versus 
H,: not all the jz ;’s are equal 


The Kruskal-Wallis procedure for testing Hp is really quite simple, involving 
considerably fewer computations than the analysis of variance. The first step is to 


k 
rank the entire set of n= )¢ n; observations from smallest to largest. Then the rank 
j=l 
sum, R_;, is calculated for each sample. Table 14.4.1 shows the notation that will be 
used: It follows the same conventions as the dot notation of Chapter 12. The only 
difference is the addition of R;;, the symbol for the rank corresponding to Y;;. 
The Kruskal-Wallis statistic, B, is defined as 


k R2 


Rj 
a ee 


Table 14.4.1 


Notation for Kruskal-Wallis Procedure 


Treatment Level 


Yi (Ri) Yi2(Ria) Vie (Riv) 
Yo (Ra) 
Yn, 1 (R,, ) Yn2(Rnj2) Vik (Rizk) 


Totals R, R> Ry 
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Theorem 
14.4.1 


Notice how B resembles the computing formula for SSTR in the analysis of variance. 
k 

Here 2 (R? /n ne and thus B, get larger and larger as the differences between the 
j= 


population locations increase. [Recall that a similar explanation was given for SSTR 
k 
and y (T5/nj;)-] 
J= 


Suppose ni, nz, ..., ng independent observations are taken from the pdfs fy,(y), fy,(y), 
..., fy,(y), respectively, where the fy,(y)’s are all continuous and have the same shape. 
Let «4; be the mean of fy,(y), i=1,2,...,k, and let Ri, R2,..., Rx denote the rank 
sums associated with each of the k samples. If Ho: 4, = U2 =... = Uz is true, 


k 2 

12 R?, 
B= i =3 1 
Hae 2 3, (n+ 1) 


has approximately a x;_, distribution and Ho should be rejected at the a level of sig- 
nificance if b > oe. 


Case Study 14.4.1 


On December 1, 1969, a lottery was held in Selective Service headquarters in 
Washington, D.C., to determine the draft status of all nineteen-year-old males. 
It was the first time such a procedure had been used since World War II. Priori- 
ties were established according to a person’s birthday. Each of the 366 possible 
birth dates was written on a slip of paper and put into a small capsule. The 
capsules were then put into a large bowl, mixed, and drawn out one by one. 
By agreement, persons whose birthday corresponded to the first capsule drawn 
would have the highest draft priority; those whose birthday corresponded to 
the second capsule drawn, the second highest priority, and so on. Table 14.4.2 
shows the order in which the 366 birthdates were drawn (160). The first date 
was September 14 (=001); the last, June 8 (= 366). 

We can think of the observed sequence of draft priorities as ranks from 1 to 
366. If the lottery was random, the distributions of those ranks for each of the 
months should have been approximately equal. If the lottery was not random, 
we would expect to see certain months having a preponderance of high ranks 
and other months, a preponderance of low ranks. 

Look at the rank totals at the bottom of Table 14.4.2. The differences from 
month to month are surprisingly large, ranging from a high of 7000 for March 
to a low of 3768 for December. Even more unexpected is the pattern in the vari- 
ation (see Figure 14.4.1). Are the rank totals listed in Table 14.4.2 and the rank 
averages pictured in Figure 14.4.1 consistent with the hypothesis that the lottery 
was random? 

Substituting the R.;’s into the formula for B gives 


12 6236)" 3768)° 
b= \ aes ) — 3(367) 
366(367) 31 31 
= 25.95 
By Theorem 14.4.1, B has approximately a chi square distribution with 11 
degrees of freedom (when Ap: yan = Feb = --- = Dec IS true). 


(Continued on next page) 
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Table 14.4.2 1969 Draft Lottery, Highest Priority (001) to Lowest Priority (366) 
Date Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. 


305 086 108 032 330 249 093 111 225 359 019 129 
159 144 029 271 298 228 350 045 161 125 034 328 
251 297 267 083 040 301 115 261 049 244 348 157 
215 210 275 O81 276 020 279 145 232 202 266 165 
101 214 293 269 364 028 188 054 082 024 310 056 
224 347 139 253 155 110 327 114 006 087 076 010 
306 091 122 147 035 085 050 168 008 234 O51 012 
199 181 213 312 321 366 O13 048 184 283 097 105 
194 338 317 219 197 335 277 106 263 342 080 043 
10 «325. 216 »=323) 218 ~=065 206 284 021 O71 220 282 041 


OMADNFWNH 


31 211 030 313 193 O11 079 100 
Totals: 6236 5886 7000 6110 6447 5872 5628 5377 4719 5656 4462 3768 


. 240+ 
3 L e 
5 200 + © © ® . Average selection number = 
q e (14+2+--- + 366)/366 = 
5 SoS Seo ASRS Sass o-oo occas 183.5 
3 366(367) 
is] ° /366 = 183.5 
2 1607 2 
7 e 
° e 
o L 
= 
8 120 F e 
< sS 
0 1 1 1 i i i 1 1 


Figure 14.4.1 


(Continued on next page) 
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(Case Study 14.4.1 continued) 


Let a = 0.01. Then Hp should be rejected if b > Paet = 24.725. But b does 
exceed that cutoff, implying that the lottery was not random. 

An even more resounding rejection of the randomness hypothesis can be 
gotten by dividing the twelve months into two half-years—the first, January 
through June; the second, July through December. Then the hypotheses to be 


tested are 
Ao: 1 = 2 
versus 
Ay: 1 Apo 


Table 14.4.3, derived from Table 14.4.2, gives the new rank sums R, and 
R», associated with the two half-years. Substituting those values into the 
Kruskal-Wallis statistic shows that the new b (with 1 degree of freedom) 
is 16.85: 


_ 12 (37,551)? (29,610) 
~~ 366(367) 182 184 
= 16.85 


3(367) 


Table 14.4.3 Summary of 1969 Draft Lottery by 
Six-Month Periods 


Jan.—June (1) July—Dec. (2) 
Rj 3/5951 29,610 
n 182 184 


The significance of 16.85 can be gauged by recalling the moments of a chi square 
random variable. If B has a chi square pdf with 1 degree of freedom, then 
E(B)=1 and Var(B) =2 (see Question 7.3.2). It follows, then, that the observed 
bis more than // standard deviations away from its mean: 


16.85 —1 
V2 


Analyzed this way, there can be little doubt that the lottery was not random! 


=11.2 


About the Data Needless to say, the way the 1969 draft lottery turned out was a 
huge embarrassment for the Selective Service Administration and a public relations 
nightmare. Many individuals, both inside and outside the government, argued that 
a “do over” was the only fair resolution. Unfortunately, any course of action would 
have inevitably angered a sizeable number of people, so the decision was made to 
stay with the original lottery, flawed though it was. 

A believable explanation for why the selections were so nonrandom is that 
(1) the birthday capsules were put into the urn by month (January capsules first, 
February capsules second, March capsules next, and so on) and (2) the capsules 
were not adequately mixed before the drawings began, leaving birthdays late in the 
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year disproportionately near the top of the urn. If (1) and (2) happened, the trend 
in Figure 14.4.1 would be the consequence. 

What is particularly vexing about the draft lottery debacle and all the furor that 
it created is that setting up a “fair” lottery is so very easy. First, the birthday capsules 
should have been numbered from 1 to 366. Then a computer or a random number 
table should have been used to generate a random permutation of those numbers. 
That permutation would define the order in which the capsules would be put into the 
urn. If those two simple steps had been followed, the likelihood of a fiasco similar 
to that shown in Figure 14.4.1 would have been essentially zero. 


Questions 


14.4.1. Use a Kruskal-Wallis test to analyze the teacher 
expectation data described in Question 8.2.7. Let a =0.05. 
What assumptions are being made? 


14.4.2. Recall the fiddler crab data given in Ques- 
tion 9.5.3. Use the Kruskal-Wallis test to compare the 
times spent waving to females by the two groups of males. 
Let wa =0.10. 


14.4.3. Use the Kruskal-Wallis method to test at the 
0.05 level that methylmercury metabolism is different for 
males and females in Question 9.2.8. 


14.4.4. Redo the analysis of the Quintus Curtius Snod- 
grass/Mark Twain data in Case Study 9.2.1, this time using 
a nonparametric procedure. 


14.4.5. Use the Kruskal-Wallis technique to test the 
hypothesis of Case Study 12.2.1 concerning the effect of 
smoking on heart rate. 


14.4.6. A sample of ten 40-W light bulbs was taken 
from each of three manufacturing plants. The bulbs were 
burned until failure. The number of hours that each 
remained lit is listed in the following table. 


Plant1 Plant2 Plant 3 


905 1109 571 
1018 1155 1346 
905 835 292 
886 1152 825 
958 1036 676 
1056 926 541 
904 1029 818 
856 1040 90 
1070 959 2246 
1006 996 104 


(a) Test the hypothesis that the median lives of bulbs 
produced at the three plants are all the same. Use 
the 0.05 level of significance. 

(b) Are the mean lives of bulbs produced at the three 
plants all the same? Use the analysis of variance with 
a =0.05. 


(c) Change the observation “2246” in the third column 
to “1500” and redo part (a). How does this change 
affect the hypothesis test? 

(d) Change the observation “2246” in the third column 
to “1500” and redo part (b). How does this change 
affect the hypothesis test? 


14.4.7. The production of a certain organic chemical 
requires the addition of ammonium chloride. The manu- 
facturer can conveniently obtain the ammonium chloride 
in any one of three forms— powdered, moderately ground, 
and coarse. To see what effect, if any, the quality of the 
NH,Cl has, the manufacturer decides to run the reac- 
tion seven times with each form of ammonium chloride. 
The resulting yields (in pounds) are listed in the following 
table. Compare the yields with a Kruskal-Wallis test. Let 
a =0.05. 


Organic Chemical Yields (1b) 


Moderately 
Powdered NH,Cl Ground NH,Cl Coarse NH,Cl 
146 150 141 
152 144 138 
149 148 142 
161 155 146 
158 154 139 
149 150 145 
154 148 137 


14.4.8. Show that the Kruskal-Wallis statistic, B, as 
defined in Theorem 14.4.1 can also be written 


i n—-n; 
pay (™)z 
j=l 
where 
R; n+l 
ny 2 


(n+ Da—n)) 
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Theorem 
14.5.1 


14.5 The Friedman Test 


The nonparametric analog of the analysis of variance for a randomized block design 
is Friedman’s test, a procedure based on within-block ranks. Its form is similar to 
that of the Kruskal-Wallis statistic, and, like its predecessor, it has approximately a 
x? distribution when Hp is true. 


Suppose k(> 2) treatments are ranked independently within b blocks. Let r.j, j = 
1,2,...,k, be the rank sum of the jth treatment. The null hypothesis that the popula- 
tion medians of the k treatments are all equal is rejected at the a level of significance 
(approximately) if 


k 


12 ‘ : 
Sir? —3bK+ D> XP at 


& 


~ bk(k + 1) 


j=l 


Case Study 14.5.1 


Baseball rules allow a batter considerable leeway in how he is permitted to 
run from home plate to second base. Two of the possibilities are the narrow- 
angle and the wide-angle paths diagrammed in Figure 14.5.1. As a means of 
comparing the two, time trials were held involving twenty-two players (206). 
Each player ran both paths. Recorded for each runner was the time it took to 
go from a point 35 feet from home plate to a point 15 feet from second base. 
Based on those times, ranks (1 and 2) were assigned to each path for each player 
(see Table 14.5.1). 


Narrow-angle Wide-angle 


Figure 14.5.1 Batter’s path from home plate to second base. 


If 4; and jt. denote the true median rounding times associated with the 
narrow-angle and wide-angle paths, respectively, the hypotheses to be tested are 


Ho: [4 = flr 
versus 
Ay: (4) # flo 


Let a = 0.05. By Theorem 14.5.1, the Friedman statistic (under Ho) will have 
approximately a x? distribution, and the decision rule will be 


Reject Hp if g > 3.84 


(Continued on next page) 
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Table 14.5.1 Times (sec) Required to Round First Base 
Player Narrow-Angle Rank Wide-Angle Rank 
1 5.50 1 5.55 2 
2 5.70 1 5.75 2 
3 5.60 2 5.50 1 
4 5.50 2 5.40 1 
5 5.85 2 5.70 1 
6 5.55 1 5.60 2 
7 5.40 2 5.35 1 
8 5.50 2 5.35 1 
9 5.15 2 5.00 1 
10 5.80 2 5.70 1 
11 5.20 2 5.10 1 
12 5.55 2 5.45 1 
13 5.35 1 5.45 2, 
14 5.00 2 4.95 1 
15 5.50 2 5.40 1 
16 5.55 2 5.50 1 
17 5.55 2 5.35 1 
18 5.50 1 5.55 2 
19 5.45 2 5.25 1 
20 5.60 2 5.40 1 
21 5.65 2 5.55 1 
22 6.30 2 6.25 1 
39 27 
But 
=)? _ 139)? + (27)] — 3(22)(3) = 6.54 
8 92(2)) _ 
implying that the two paths are not equivalent. The wide-angle path appears to 
enable runners to reach second base quicker. 


Questions 


14.5.1. The following data come from a field trial set up 
to assess the effects of different amounts of potash on the 
breaking strength of cotton fibers (25). The experiment 
was done in three blocks. The five treatment levels—36, 
54, 72, 108, and 144 lbs of potash per acre —were assigned 


Pressley Strength Index for Cotton Fibers 


Treatment (pounds of potash/acre) 


36 54 72 108 144 

1 762 814 7.76 717 7.46 

Blocks 2 8.00 815 7.73 7.57 7.68 
3 7.93, 787 7.74 7.80 7.21 


randomly within each block. The variable recorded was 
the Pressley strength index. Compare the effects of the dif- 
ferent levels of potash applications using Friedman’s test. 
Let a =0.05. 


14.5.2. Use Friedman’s test to analyze the Transylvania 
effect data given in Case Study 13.2.3. 


14.5.3. Until its recent indictment as a possible car- 
cinogen, cyclamate was a widely used sweetener in 
soft drinks. The following data show a comparison of 
three laboratory methods for determining the percentage 
of sodium cyclamate in commercially produced orange 
drink. All three procedures were applied to each of twelve 
samples (156). 
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Percent Sodium Cyclamate (w/w) 


Method 
Sample Picryl Chloride Davies AOAC 
1 0.598 0.628 0.632 
2, 0.614 0.628 0.630 
3 0.600 0.600 0.622 
4 0.580 0.612 0.584 
5 0.596 0.600 0.650 
6 0.592 0.628 0.606 
7 0.616 0.628 0.644 
8 0.614 0.644 0.644 
9 0.604 0.644 0.624 
10 0.608 0.612 0.619 
11 0.602 0.628 0.632 
12 0.614 0.644 0.616 


Use Friedman’s test to determine whether the differences 
from method to method are statistically significant. Let 
a=0.05. 


14.5.4. Use Friedman’s test to compare the effects of 
habitat density on cockroach aggression for the data given 
in Question 8.2.4. Let a = 0.05. Would the conclusion be 
any different if the densities were compared using the 
analysis of variance? 


14.5.5. Compare the acrophobia therapies described in 
Case Study 13.2.1 using the Friedman test. Let a = 0.01. 
Does your conclusion agree with the inference reached 
using the analysis of variance? 


14.5.6. Suppose that & treatments are to be applied within 
each of b blocks. Let 7, denote the average of the bk ranks 
and let 7; = (1/b)r;. Show that the Friedman statistic 
given in Theorem 14.5.1 can also be written 


ae 
2 Fgapete 


j=l 


What analysis of variance expression does this resemble? 


14.6 Testing for Randomness 


All hypothesis tests, parametric as well as nonparametric, make the implicit assump- 
tion that the observations comprising a given sample are random, meaning that 
the value of y; does not predispose the value of y;. Should that not be the case, 
identifying the source of the nonrandomness—and doing whatever it takes to elim- 
inate it from future observations—necessarily becomes the experimenter’s first 


objective. 


Examples of nonrandomness are not uncommon in industrial settings, where 
successive measurements made on a particular piece of equipment may show 
a trend, for example, if the machine is slowly slipping out of calibration. The 
other extreme—where measurements show a nonrandom alternating pattern (high 
value, low value, high value, low value, ...)—can occur if successive measurements 
are made by two different operators, whose standards or abilities are markedly 
different, or, perhaps, by one operator using two different machines. 

A variety of tests based on runs of one sort or another can be used to examine 
the randomness of a sequence of measurements. One of the most useful is a test 
based on the total number of “runs up and down.” 


Suppose that y, yo,... 


, Yn denotes a set of n time-ordered measurements. Let 


sgn(y; — y;-1) denote the algebraic sign of the difference y; — y;_1. (It will be assumed 
that the y;’s represent a continuous random variable, so the probability of y; and y;_ 
being equal is zero.) The n observations, then, produce an ordered arrangement 
of n — 1 pluses and/or minuses representing the signs of the differences between 
consecutive measurements (see Figure 14.6.1). 


Figure 14.6.1 Data: yy, 


SS 
sgn(y2— y1) 


2 Y3 +++ Yn—-1 Yn 
Se 


sgn (y3 — y2) Sgn (Yn — Yn—1) 


Theorem 
14.6.1 
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For example, the n = 5 observations 
14.2 10.6 11.2 12.1 9.3 
generate the “sgn” sequence 
_ 4 +4 _ 


which corresponds to an initial run down (that is, going from 14.2 to 10.6), followed 
by two runs up, and ending with a final run down. 

Let W denote the total number of runs up and down, as reflected by the 
sequence sgn(y2 — y1), sgn(y3 — y2),---,$8N(Yn — yn-1). For the example just cited, 
W =3. In general, if W is too large or too small, it can be concluded that the 
y;’s are not random. The appropriate decision rule derives from an approximate 
Z ratio. 


Let W denote the number of runs up and down in a sequence of n observations, where 
n > 2. If the sequence is random, then 


a. E(W)= 2-1 
— 16n—29 
b. Var(W) = “oo 
and 
C Ea = Z, whenn > 20. 


Proof See (125) and (204). 


Case Study 14.6.1 


The first widespread labor dispute in the United States occurred in 1877. Rail- 
roads were the target, and workers were idled from Pittsburgh to San Francisco. 
That initial confrontation may have been a long time coming, but organiz- 
ers were quick to recognize what a powerful weapon a work stoppage could 
be —36,757 more strikes were called between 1881 and 1905! 

For that twenty-five-year period, Table 14.6.1 shows the annual numbers of 
strikes that were called and the percentages that were deemed successful (31). 
By definition, a strike was considered “successful” if most or all of the workers’ 
demands were met. 

An obvious question suggested by the nature of these data is whether the 
workers’ successes from year to year were random. One plausible hypothe- 
sis would be that the percentages of successful strikes should show a trend 
and tend to increase, as unions acquired more and more power. On the other 
hand, it could be argued that years of high success rates might tend to alter- 
nate with years of low success rates, indicating a kind of labor and management 
standoff. Still another hypothesis, of course, would be that the percentages show 
no patterns whatsoever and qualify as a random sequence. 

The last column shows the calculation of sgn(y; — y;-1) for i =2,3,..., 25. 
By inspection, the number of runs up and down in that sequence of pluses and 
minuses is eighteen. To test 


(Continued on next page) 
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(Case Study 14.6.1 continued) 


Table 14.6.1 

Year Number of Strikes % Successful, y; sgn(y; — y;_1) 
1881 451 61 — 
1882 454 53 + 
1883 478 58 — 
1884 443 51 + 
1885 645 52 — 
1886 1432 34 + 
1887 1436 45 + 
1888 906 52 — 
1889 1075 46 + 
1890 1833 52 — 
1891 1717 37 + 
1892 1298 39 + 
1893 1305 50 - w= 18 
1894 1349 38 + 
1895 1215 55 + 
1896 1026 59 — 
1897 1078 57 + 
1898 1056 64 + 
1899 1797 73 — 
1900 1779 46 + 
1901 2924 48 — 
1902 3161 47 — 
1903 3494 40 — 
1904 2307 35 + 
1905 2077 40 


Ho: The y;’s are random with respect to the number of runs up and down 
versus 
H: The y,’s are not random with respect to the number of runs up and down 


at the a = 0.05 level of significance, we should reject the null hypothesis if 
wEW) is either (1) < —Zq/2 = —1.96 or (2) > Za/2 = 1.96. Given that n = 25, 


J/Var(W) 
2(25) —1 
E(W)= a = 16.3 
and 
16(25) — 29 
Var(W) = poe F2 =4.12 
90 
so the observed test statistic is +0.84: 
18 — 16.3 
Z= ——— =0.84 
V/V 4.12 


Our conclusion, then, is to fail to reject Hy —it is believable, in other words, that 
the observed sequence of runs up and down could, in fact, have come from a 
sample of twenty-five random observations. 
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About the Data Another hypothesis suggested by these data is that the per- 
centage of successful strikes might vary inversely with the number of strikes: As 
the latter increased, the number of “frivolous” disputes might also have increased, 
which could understandably lead to a lower percentage of successful resolutions. In 
point of fact, that explanation does appear to have some merit. A linear fit of the 
twenty-five observations yields the equation 


% successful = 56.17 — 0.0047. number of strikes 


and the null hypothesis Ho: 6; = 0 is rejected at the a =0.05 level of significance. 


Questions 


14.6.1. The data in the table examine the relationship 
between stock market changes (1) during the first few 
days in January and (2) over the course of the entire year. 
Included are the years from 1950 through 1986. 


(a) Use Theorem 14.6.1 to test the randomness of the 
January changes (relative to the number of runs up 
and down). Let a = 0.05. 

(b) Use Theorem 14.6.1 to test the randomness of the 
annual changes. Let a = 0.05. 


% Change for 


First5 Daysin % Change 
Year Jan., x for Year, y 
1950 2.0 21.8 
1951 2.3 16.5 
1952 0.6 11.8 
1953 —0.9 —6.6 
1954 0.5 45.0 
1955 —1.8 26.4 
1956 —2.1 2.6 
1957 —0.9 —14.3 
1958 2.5 38.1 
1959 0.3 8.5 
1960 —0.7 —3.0 
1961 1:2 23.1 
1962 —3.4 —11.8 
1963 2.6 18.9 
1964 1.3 13.0 
1965 0.7 9.1 
1966 0.8 —13.1 
1967 3.1 20.1 
1968 0.2 Td 
1969 —2.9 —11.4 
1970 0.7 0.1 
1971 0.0 10.8 
1972 1.4 15.6 
1973 1.5 —17.4 
1974 —1.5 —29.7 
1975 22 31.5 
1976 4.9 19.1 
1977 —2.3 —11.5 


1978 —4.6 1.1 
1979 2.8 12.3 
1980 0.9 25.8 
1981 —2.0 —9.7 
1982 —2.4 14.8 
1983 3.2 17.3 
1984 2.4 1.4 
1985 —1.9 26.3 
1986 —1.6 14.6 


14.6.2. Listed below for two consecutive fiscal years 
are the monthly numbers of passenger boardings at a 
Florida airport. Use Theorem 14.6.1 to test whether these 
twenty-four observations can be considered a random 
sequence, relative to the number of runs up and down. 
Let a=0.05. 


Passenger Passenger 
Month Boardings Month Boardings 
July 41,388 July 44,148 
Aug. 44,880 Aug. 42,038 
Sept. 33,556 Sept. 35,157 
Oct. 34,805 Oct. 39,568 
Nov. 33,025 Nov. 34,185 
Dec. 34,873 Dec. 37,604 
Jan. 31,330 Jan. 28,231 
Feb. 30,954 Feb. 29,109 
March 32,402 March 38,080 
April 38,020 April 34,184 
May 42,828 May 39,842 
June 41,204 June 46,727 


14.6.3. On the next page is a partial statistical summary 
of the first twenty-four Super Bowls (33). Of particu- 
lar interest to advertisers is the network share that each 
game garnered. Can those shares be considered a random 
sequence, relative to the number of runs up and down? 
Test the appropriate hypothesis at the a = 0.05 level of 
significance. 
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Network 

Game, Winner, MVP Is _ Share 
Year Loser Score QB _ (network) 
I Green Bay (NFL) 35 1 79 
1967 Kansas City (AFL) 10 (CBS/NBC 

combined) 
II Green Bay (NFL) 33 1 68 
1968 Oakland (AFL) 14 (CBS) 
Ill NY Jets (AFL) 1 1 71 
1969 Baltimore (NFL) 7 (NBC) 
IV Kansas City (AFL) 23 1 9 
1970 Minnesota (NFL) a (CBS) 
Vv Baltimore (AFC) 16 0 75 
1971 Dallas (NFC) 13 (NBC) 
VI Dallas (NFC) 24 1 74 
1972 Miami (AFC) 3 (CBS) 
VIL = Miami (AFC) 14 0 72 
1973 Washington (NFC) 7 (NBC) 
VIII = Miami (AFC) 24 0 73 
1974 Minnesota (NFC) 7 (CBS) 
IX Pittsburgh (AFC) 16 O 72 
1975 Minnesota (NFC) 6 (NBC) 
x Pittsburgh (AFC) 21 0 8 
1976 Dallas (NFC) 17 (CBS) 
XI Oakland (AFC) 32 0 73 
1977 Minnesota (NFC) 14 (NBC) 
XII Dallas (NFC) 272 «067 
1978 Denver (AFC) 10 (CBS) 
XIII Pittsburgh (AFC) 35 1 74 
1979 Dallas (NFC) 31 (NBC) 
XIV Pittsburgh (AFC) 31 1 67 
1980 Los Angeles (AFC) 19 (CBS) 
XV Oakland (AFC) 27 1 63 
1981 Philadelphia (NFC) 10 (NBC) 
XVI San Francisco (NFC) 26 1 73 
1982 Cincinnati (AFC) 21 (CBS) 
XVII Washington (NFC) 27 0 69 
1983. Miami (AFC) 17 (NBC) 
XVIII LA Raiders (AFC) 38 O 71 
1984 Washington (NFC) 9 (CBS) 
XIX San Francisco (NFC) 38 1 63 
1985 Miami (AFC) 16 (ABC) 
XX Chicago (NFC) 46 0 70 
1986 New England (AFC) 10 (NBC) 
XXI_ NY Giants (NFC) 39 1 66 
1987 Denver (AFC) 20 (CBS) 
XXII Washington (NFC) 42 1 62 
1988 Denver (AFC) 10 (ABC) 
XXIII San Francisco (NFC) 20 0 68 
1989 Cincinnati (AFC) 16 (NBC) 
XXIV San Francisco (NFC) 55 1 63 
1990 Denver (AFC) 10 (CBS) 


14.6.4. In the next column are the lengths (in mm) of 
furniture dowels recorded as part of an ongoing quality- 
control program. Listed are the measurements made on 
thirty samples (each of size 4) taken in order from the 


assembly line. Is the variation in the sample averages ran- 
dom with respect to the number of runs up and down? 
Do an appropriate hypothesis test at the a = 0.05 level of 
significance. 


Sample yo 3 Ya y 


1 46.1 444 453 442 45.0 
2 46.0 454 425 444 446 
3 443 440 454 43.9 444 
4 449 43.7 45.2 448 44.7 
5 43.0 45.3 45.9 43.8 44.5 
6 46.0 43.2 444 43.7 443 
7 46.0 446 454 464 45.6 
8 46.1 45.55 45.0 45.5 45.5 
9 42.8 45.1 449 443 443 
10 45.0 46.7 43.0 448 44.9 
11 45.5 445 45.1 47.1 45.6 
12 45.8 446 448 45.1 45.1 
13 45.1 45.4 460 45.4 45.5 
14 446 43.8 442 43.9 441 
15 448 45.5 45.2 46.2 45.4 
16 45.8 441 433 45.8 448 
17 441 448 461 45.5 45.1 
18 445 43.6 45.1 46.9 45.0 
19 45.2 43.1 463 464 45.3 
20 45.9 468 468 45.8 46.3 
21 440 44.7 462 454 45.1 
22 43.4 446 454 444 445 
23 43.1 446 445 45.8 44.5 
24 46.6 43.3 45.1 442 448 
25 46.2 449 453 46.0 45.6 
26 42.55 43.4 443 42.7 43.2 
27 43.4 43.3 434 435 43.4 
28 42.3 42.4 466 423 43.4 
29 41.9 42.9 42.0 42.9 42.4 
30 43.2 43.5 42.2 447 43.4 


14.6.5. Listed below are forty ordered observations gen- 
erated by Minitab’s RANDOM command that presum- 
ably represent a normal distribution with 4 =5 and o =2. 
Can the sample be considered random with respect to the 
number of runs up and down? 


Obs.# y, Obs.# y, Obs.# y; Obs.# — y; 
1 7.0680 11 7.6979 21 5.9828 31 5.2625 
2 40540 12 4.4338 22 1.4614 32 5.9047 
3 66165 13 5.6538 23 9.2655 33 4.6342 
4 1.2166 14 8.0791 24 4.9281 34 5.3089 
5 46158 15 4.7458 25 10.5561 35 5.4942 
6 7.7540 16 3.5044 26 6.1738 36 #6.6914 
7 7.7300 17 1.3071 27 5.4895 37 1.4380 
8 65109 18 5.7893 28 3.6629 38 8.2604 
9 3.8933 19 45241 29 3.7223 39 5.0209 
10 2.7533 20 5.3291 30 3.5211 40 0.5544 


14.6.6. Sunnydale Farms markets an all-purpose fertilizer 
that is supposed to contain, by weight, 15% potash (KO). 
Samples were taken daily in October from three bags 
chosen at random as they came off the filling machine. 
Tabulated on the right are the K,O percentages recorded. 
Calculate the range (= Ymax — Ymin) for each sample. Use 
Theorem 14.6.1 to test whether the variation in the ranges 
can be considered random with respect to the number of 
runs up and down. 
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Date y ys Date yy yO 

10/1 16.1 144 15.3 10/15. 16.3 13.3 15.3 
10/2 16.0 164 13.5 10/16 174 13.8 143 
10/3. 143 14.0 15.4 10/17 13.5 11.0 15.4 
10/4 148 13.1 15.2 10/18 156 92 18.9 
10/5. 12.0 15.4 16.4 10/19 16.3 17.6 20.5 
10/8 16.4 12.3 14.2 10/22. 143 15.6 17.0 
10/9 16.9 14.2 15.8 10/23 15.4 15.3 15.4 


10/10 172 16.0 14.9 10/24 143 144 18.6 
10/11 106 15.3 14.9 10/25 13.9 14.9 14.0 
10/12, 15.0 19.2 10.0 10/26 15.2 15.5 142 


14.7 Taking a Second Look at Statistics (Comparing 
Parametric and Nonparametric Procedures) 


Virtually every parametric hypothesis test an experimenter might consider doing has 
one or more nonparametric analogs. Using two independent samples to compare 
the locations of two distributions, for example, can be done with a two-sample f test 
or with a Wilcoxon signed rank test. Likewise, comparing & treatment levels using 
dependent samples can be accomplished with the (parametric) analysis of variance 
or with the (nonparametric) Friedman’s test. Having alternative ways to analyze 
the same set of data inevitably raises the same sorts of questions that surfaced in 
Section 13.4—which procedure should be used in a given situation, and why? 

The answers to those questions are rooted in the origins of the data—that is, in 
the pdfs generating the samples—and what those origins imply about (1) the rela- 
tive power of the parametric and nonparametric procedures and (2) the robustness 
of the two procedures. As we have seen, parametric procedures make assumptions 
about the origin of the data that are much more specific than the assumptions 
made by nonparametric procedures. The (pooled) two-sample t test, for example, 
assumes that the two sets of independent observations come from normal distri- 
butions with the same standard deviation. The Wilcoxon signed rank test, on the 
other hand, makes the much weaker assumption that the observations come from 
symmetric distributions (which, of course, include normal distributions as a special 
case). Moreover, each observation does not have to come from the same symmetric 
distribution. 

In general, if the assumptions made by a parametric test are satisfied, then that 
procedure will be superior to any of its nonparametric analogs in the sense that 
its power curve will be steeper. (Recall Figure 6.4.5—if the normality assumption is 
met, the parametric procedure will have a power curve similar to that for Method B; 
the nonparametric procedure would have a power curve similar to Method A’s.) 

If one or more of the parametric procedure’s assumptions are not satisfied, 
the distribution of its test statistic will not be exactly what it would have been 
had the assumptions all been met (recall Figure 7.4.5). If the differences between 
the “theoretical” test statistic distribution and the “actual” test statistic distribution 
are considerable, the integrity of the parametric procedure is obviously compro- 
mised. Whether those two distributions will be considerably different depends on 
the robustness of the parametric procedure with respect to whichever assumptions 
are being violated. 
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Figure 14.7.1 


Concluding this section is a set of Monte Carlo simulations that compare the 
one-way analysis of variance to a Kruskal-Wallis test. In each instance, the data con- 
sist of nj; =5 observations taken on each of k = 4 treatment levels. Included are 
simulations that focus on (1) the power of the two procedures when the normality 
assumption is met and (2) the robustness of the two tests when neither the normality 
nor the symmetry assumptions are satisfied. Each simulation is based on one hun- 
dred replications, and the twenty observations generated for each replication (by 
the RANDOM command) were analyzed twice, once using the analysis of variance 
and again using the Kruskal-Wallis test (see Appendix 14.A.1 for an example of the 
Minitab syntax). 

Figure 14.7.1 shows the distribution of the one hundred observed F ratios 
when all the Hp assumptions made by the analysis of variance are satisfied—that 
is, five observations were taken on each of four treatment levels, where all twenty 
observations were normally distributed with the same mean and the same standard 
deviation. Given that n; =5,k =4, and n = 20, there would be 3 df for Treatments 
and 16 df for Error (recall Figure 12.2.1). Superimposed over the histogram is the 
pdf for an F316 random variable. Clearly, the agreement between the F curve and 
the histogram is excellent. 


0.60 \ 
F316 pdf 
oan 3,16 


0 1 2 3 4 5 


Observed F ratio 


Figure 14.7.2 is the analogous “H” distribution for the Kruskal-Wallis test. The 
one hundred data sets analyzed were the same that gave rise to Figure 14.7.1. Super- 
imposed is the x? pdf. As predicted by Theorem 14.4.1, the distribution of observed 
b values is approximated very nicely by the chi square curve with 3 (=k — 1) df. 

One of the advantages of nonparametric procedures is that violations of their 
assumptions tend to have relatively mild repercussions on the distributions of their 
test statistics. Figure 14.7.3 is a case in point. Shown there is a histogram of Kruskal- 
Wallis values calculated from one hundred data sets where each of the twenty 
observations (n; =5 and k =4) came from an exponential pdf with A = 1—that is, 
from fy(y) =e’, y > 0. The latter is a sharply skewed pdf that violates the sym- 
metry assumption underlying the Kruskal-Wallis test. The actual distribution of b 
values, though, does not appear to be much different from the values produced in 
Figure 14.7.2, where all the assumptions of the Kruskal-Wallis test were met. 

A similar insensitivity to the data’s underlying pdf is not entirely shared by the 
F test. Figure 14.7.4 summarizes the results of applying the analysis of variance to 


Figure 14.7.2 


Figure 14.7.3 


Figure 14.7.4 


Density 


Density 
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Observed B 


0.20 i 


Observed B 


the same set of one hundred replications that produced Figure 14.7.3. Notice that a 
handful of the data sets yielded F ratios much larger than the F316 curve would have 
predicted. Recall that a similar skewness was observed when the ¢ test was applied 


to exponential data where n was small (see Figure 7.4.6b). 
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Having weaker assumptions and being less sensitive to violations of those 
assumptions are definite advantages that nonparametric procedures often have over 
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Figure 14.7.5 


Figure 14.7.6 


their parametric counterparts. But that broader range of applicability does not come 
without a price: Nonparametric hypothesis tests will make Type II errors more often 
than will parametric procedures when the assumptions of the parametric procedures 


are satisfied. 
Consider, for example, the two Monte Carlo simulations pictured in 
Figures 14.7.5 and 14.7.6. The former shows the results of applying the Kruskal- 
Wallis test to one hundred sets of k-sample data, where the five measurements 
representing each of the first three treatment levels came from normal distributions 
with  =0 and o = 1, while the five measurements representing the fourth treatment 
level came from a normal distribution with 4. = 1 and o = 1. As expected, the distri- 
bution of observed b values has shifted to the right, compared to the Ap distribution 
shown in Figure 14.7.3. More specifically, 26% of the one hundred data sets pro- 
duced Kruskal-Wallis values in excess of 7.815 (= xXj953), meaning that Hp would 
have been rejected at the a = 0.05 level of significance. [If Hy were true, of course, 
the theoretical percentage of b values exceeding 7.815 would be 5%. Only 1% of 
the data sets, though, exceeded the a =0.01 cutoff (= X69; = 11.345), which is the 


same percentage expected if Hp were true.] 
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Figure 14.7.6 shows the results of doing the analysis of variance on the same 
one hundred data sets used for Figure 14.7.5. As was true for the Kruskal-Wallis 
calculations, the distribution of observed F ratios has shifted to the right (compare 
Figure 14.7.6 to Figure 14.7.1). What is especially noteworthy, though, is that the 
observed F ratios have shifted much further to the right than did the observed b val- 
ues. For example, while only 1% of the observed b values exceeded the a = 0.01 
cutoff (= 11.345), a total of 8% of the observed F ratios exceeded their a = 0.01 
cutoff (= F.99,3,16)- 

So, is there an easy answer to the question of which type of procedure to use, 
parametric or nonparametric? Sometimes yes, sometimes no. If it seems reasonable 
to believe that all the assumptions of the parametric test are satisfied, then the para- 
metric test should be used. For all those situations, though, where the validity of 
one or more of the parametric assumptions is in question, the choice becomes more 
problematic. If the violation of the assumptions is minimal (or if the sample sizes are 
fairly large), the robustness of the parametric procedures (along with their greater 
power) usually gives them the edge. Nonparametric tests tend to be reserved for 
situations where (1) sample sizes are small, and (2) there is reason to believe that 
“something” about the data is markedly inconsistent with the assumptions implicit 
in the available parametric procedures. 
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Figure 14.A.1.1 


The Sign Test 
Figure 14.A.1.1 shows Minitab’s sign test routine applied to ten paired samples— (97, 
113), (106, 113), ... , (96, 126). The basic command is 


MTB > stest 0.0 c3; 
SUBC > alternative 0. 


where c3 contains the within-pair differences. The subcommand ALTERNATIVE 
0 makes H, two-sided. One-sided alternative hypotheses require that ALTERNA- 
TIVE 1 (if the rejection region is to the right) or ALTERNATIVE ~1 (if the rejection 
region is to the left) be used. 


MTB > set cl 

DATA > 97 106 106 95 102 111 115 104 90 96 
DATA > end 

MTB > set c2 

DATA > 113 113 101 119 111 122 121 106 110 126 
DATA > en 

MTB > let c3 = ¢c2 — cl 

MTB > stest 0.0 c3; 

SUBC > alternative 0. 


Sign Test for Median: C3 


Sign ee of median = 0.00000 versus not = 0. Sonya 
Below Equal Above _ DIAN 
C3 10 Ht 0 9 0. 0218 aie 00 


The Wilcoxon Signed Rank Test 


The Wilcoxon signed rank statistic of Theorem 14.3.2 is calculated using the com- 
mand MTB > wtest ji, cl, where the y;’s have been entered in cl. As with the 
sign test, the subcommand ALTERNATIVE 0 makes H; two-sided. Figure 14.A.1.2 
summarizes Minitab’s analysis of the shark data from Case Study 14.3.1. 
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Figure 14.A.1.2 


Figure 14.A.1.3 


MTB > set cl 

DATA > 13.32 13.06. 14.02 11,86 13.58 13.77 13..51°14.42 14:44 15,43 

DATA > end 
> 
> 


MTB wtest 14.6 cl; 
SUBC alternative 0. 


Wilcoxon Signed Rank Test: C1 


Test of median = 14.60 versus median not = 14.60 


N For Wilcoxon Estimated 
N test Statistic P Median 
cl 10 10 4.5 0.022 13°75 


The Kruskal-Wallis Test 


Data are entered for the Kruskal-Wallis test using the stacked format seen earlier 
in connection with the randomized block analysis of variance in Chapter 13. The 
syntax, though, is different. First, the data from each treatment level are entered 
in a separate column. Then a stack command is used to transfer all those data to a 
single column (in this case, c5). Finally, an additional column—here, c6—is defined 
that identifies the treatment level represented by each data point in the stacked 
column. 

Figure 14.A.1.3 shows the Kruskal-Wallis input and output for the heart rate 
data given in Case Study 12.2.1. 


MTB > set cl 

DATA > 69 52 71 58 59 65 
DATA > end 

MTB > set c2 

DATA > 55 60 78 58 62 66 
DATA > end 

MTB > set c3 

DATA > 66 81 70 77 57 79 
DATA > end 

MTB > set c4 

DATA > 91 72 81 67 95 84 
DATA > end 

MTB > stack cl c2 c3 c4 c5 
MTB > set c6 

DATA > 6(1) 6(2) 6(3) 6(4) 
DATA > end 

MTB > kruskal-wallis c5 cé6. 
Kruskal-Wallis Test: C5 versus C6 
Kruskal-Wallis Test on C5 


Ccé N Median Ave. Rank Z 

1 6 62.00 8.1 -1.77 

2 6 61.00 8.3 -1.67 

3 6 73.50 14.0 0.60 

4 6 82.50 19.6 2.83 

Overall 24 12.5 

BSc 1071: (DE = 3) sP = 0N013 

H = 10.73 DF = 3 P = 0.013 (adjusted for ties) 


The Friedman Test 


The syntax for Friedman’s test is similar to what is used for the Kruskal-Wallis 
procedure, except that an additional column identifying the block to which each 
observation belongs must be included. As before, the data from each treatment level 
are initially put into separate columns; then those columns are stacked. For the case 
of two treatment levels, the final command would be 


MTB > friedman c3 c4 c5 


Figure 14.A.1.4 


Appendix 14.A.1 Minitab Applications 695 


where c3 is the stacked column of the entire data set, c4 is a column identifying the 
treatment level represented by each observation, and c5 is a column giving the block 
location of each observation. 

Figure 14.A.1.4 is the Friedman analysis of the baseball data in Case 
Study 14.5.1. The observed test statistic is denoted S (instead of the g on p. 682). 


MTB > set cl 

DATA > 5.50 5.70 5.60 5.50 5.85 5.55 5.40 5.50 5.15 5.80 5.20 
DATA > 5.55 5.35 5.00 5.50 5.55 5.55 5.50 5.45 5.60 5.65 6.30 
DATA > end 

MTB > set c2 

DATA, > 5.55 5:75 5.50 5.40 5.70 5.60 5.35 5.35 5.00 5.70 5.10 
DATA > 5.45 5.45 4.95 5.40 5.50 5.35 5.55 5.25 5.40 5.55 6.25 
DATA > end 

MTB > stack cl c2 c3 

MTB > set c4 

os 1 so me! See Oa! i Va nee Ug a? Oe os Rag es es Nes Os ney Pager! aa ea 

DATA. 922. 2.9.2.2 22°22 2252-2 2.2 222.22 2 

DATA > end 

MTB > set c5 

DATA. > 122.3 4 56 7-8 9 10 11.12 13.14.15 16 17 18 19 20) 21 22 
DATA => 12-3 4-5 67°89 1011 1213-74 15 16 17°18 °19) 20° 24-22 
DATA > end 

MTB > friedman c3 c4 cS. 


Friedman Test: C3 versus C4 blocked by C5 


S = 6.55 DF =1 P= 0.011 


Sum of 
c4 N Est Median Ranks 
1 22 5.5500 39.0 
2 22 5.4500 27.20 


Grand median = 5.5000 


Appendix 
STATISTICAL TABLES 


A.1 Cumulative Areas under the Standard Normal A.4 Percentiles of F Distributions 
Distribution A.5 Upper Percentiles of Studentized Range 

A.2_ Upper Percentiles of Student ¢ Distributions Distributions 

A.3 Upper and Lower Percentiles of x7 A.6 Upper and Lower Percentiles of the Wilcoxon 
Distributions Signed Rank Statistic, W 
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Table A.| 


Cumulative Areas under the Standard Normal Distribution 


0 
Zz 0 1 2 3 4 >) 6 7 8 9 
—3: 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010 
—2.9 0.0019 0.0018 0.0017 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014 
—2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019 
—2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026 
—2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036 
—2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048 
—2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064 
—2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084 
—2.2 0.0139 0.0136 0.0132 0.0129 0.0126 0.0122 0.0119 0.0116 0.0113 0.0110 
—2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143 
—2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183 
—1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0238 0.0233 
—1.8 0.0359 0.0352 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0300 0.0294 
V7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367 
—1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455 
—1,5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0570 0.0559 
—1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0722 0.0708 0.0694 0.0681 
=13 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823 
—1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985 
—1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170 
—1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379 
—0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611 
—0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867 
—0.7 0.2420 0.2389 0.2358 0.2327 0.2297 0.2266 0.2236 0.2206 0.2177 0.2148 
—0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451 
—0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776 
—0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121 
—0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483 
—0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859 
—0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247 
—0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641 


Table A. Cumulative Areas under the Standard Normal Distribution (cont.) 


z 0 1 2 3 4 5 6 i) 8 9 
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753 
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141 
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517 
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 
0.7 0.7580 0.7611 0.7642 0.7673 0.7703 0.7734 0.7764 0.7794 0.7823 0.7852 
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 
1,3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9278 0.9292 0.9306 0.9319 
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9430 0.9441 
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545 
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633 
1.8 0.9641 0.9648 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9700 0.9706 
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9762 0.9767 
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817 
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857 
2:2 0.9861 0.9864 0.9868 0.9871 0.9874 0.9878 0.9881 0.9884 0.9887 0.9890 
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916 
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936 
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952 
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964 
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974 
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981 
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986 
3. 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990 


Source: From Samuels/Witmer, Statistics for Life Sciences, Table 3, p. 675, © 2003 Pearson Education, Inc. 


Reproduced by permission of Pearson Education, Inc. 
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Upper Percentiles of Student ¢ Distributions 


Student f distribution 
with df degrees of freedom 


Area=a 


a 
df 0.20 0.15 0.10 0.05 0.025 0.01 0.005 
1 1.376 1.963 3.078 6.3138 12.706 31.821 63.657 
2 1.061 1.386 1.886 2.9200 4.3027 6.965 9.9248 
3 0.978 1.250 1.638 2.3534 3.1825 4.541 5.8409 
4 0.941 1.190 1.533 2.1318 2.7764 3.747 4.6041 
5 0.920 1.156 1.476 2.0150 2.5706 3.365 4.0321 
6 0.906 1.134 1.440 1.9432 2.4469 3.143 3.7074 
7 0.896 1.119 1.415 1.8946 2.3646 2.998 3.4995 
8 0.889 1.108 1.397 1.8595 2.3060 2.896 3.3554 
9 0.883 1.100 1.383 1.8331 2.2622 2.821 3.2498 
10 0.879 1.093 1.372 1.8125 2.2281 2.764 3.1693 
ll 0.876 1.088 1.363 1.7959 2.2010 2.718 3.1058 
12 0.873 1.083 1.356 1.7823 2.1788 2.681 3.0545 
13 0.870 1.079 1.350 1.7709 2.1604 2.650 3.0123 
14 0.868 1.076 1.345 1.7613 2.1448 2.624 2.9768 
15 0.866 1.074 1.341 1.7530 2.1315 2.602 2.9467 
16 0.865 1.071 1.337 1.7459 2.1199 2.583 2.9208 
17 0.863 1.069 1.333 1.7396 2.1098 2.567 2.8982 
18 0.862 1.067 1.330 1.7341 2.1009 2.552 2.8784 
19 0.861 1.066 1.328 1.7291 2.0930 2.539 2.8609 
20 0.860 1.064 1.325 1.7247 2.0860 2.528 2.8453 
21 0.859 1.063 1.323 1.7207 2.0796 2.518 2.8314 
22 0.858 1.061 1.321 1.7171 2.0739 2.508 2.8188 
23 0.858 1.060 1.319 1.7139 2.0687 2.500 2.8073 
24 0.857 1.059 1.318 1.7109 2.0639 2.492 2.7969 
25 0.856 1.058 1.316 1.7081 2.0595 2.485 2.7874 
26 0.856 1.058 i.315 1.7056 2.0555 2.479 2.7787 
27 0.855 1.057 1.314 1.7033 2.0518 2.473 2.7707 
28 0.855 1.056 1.313 1.7011 2.0484 2.467 2.7633 
29 0.854 1.055 1.311 1.6991 2.0452 2.462 2.7564 
30 0.854 1.055 1.310 1.6973 2.0423 2.457 2.7500 
31 0.8535 1.0541 1.3095 1.6955 2.0395 2.453 2.7441 
32 0.8531 1.0536 1.3086 1.6939 2.0370 2.449 2.7385 
33 0.8527 1.0531 1.3078 1.6924 2.0345 2.445 2.7333 
34 0.8524 1.0526 1.3070 1.6909 2.0323 2.441 2.7284 


(cont.) 
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Table A.2 Upper Percentiles of Student t Distributions (cont.) 


a 

df 0.20 0.15 0.10 0.05 0.025 0.01 0.005 

35 0.8521 1.0521 1.3062 1.6896 2.0301 2.438 2.7239 
36 0.8518 1.0516 1.3055 1.6883 2.0281 2.434 2.7195 
37 0.8515 1.0512 1.3049 1.6871 2.0262 2.431 2.7155 
38 0.8512 1.0508 1.3042 1.6860 2.0244 2.428 2.7116 
39 0.8510 1.0504 1.3037 1.6849 2.0227 2.426 2.7079 
40 0.8507 1.0501 1.3031 1.6839 2.0211 2.423 2.7045 
41 0.8505 1.0498 1.3026 1.6829 2.0196 2.421 2.7012 
42 0.8503 1.0494 1,3020 1.6820 2.0181 2.418 2.6981 
43 0.8501 1.0491 1.3016 1.6811 2.0167 2.416 2.6952 
44 0.8499 1.0488 1.3011 1.6802 2.0154 2.414 2.6923 
45 0.8497 1.0485 1.3007 1.6794 2.0141 2.412 2.6896 
46 0.8495 1.0483 1.3002 1.6787 2.0129 2.410 2.6870 
47 0.8494 1.0480 1.2998 1.6779 2.0118 2.408 2.6846 
48 0.8492 1.0478 1.2994 1.6772 2.0106 2.406 2.6822 
49 0.8490 1.0476 1.2991 1.6766 2.0096 2.405 2.6800 
50 0.8489 1.0473 1.2987 1.6759 2.0086 2.403 2.6778 
51 0.8448 1.0471 1.2984 1.6753 2.0077 2.402 2.6758 
52 0.8486 1.0469 1.2981 1.6747 2.0067 2.400 2.6738 
53 0.8485 1.0467 1.2978 1.6742 2.0058 2.399 2.6719 
54 0.8484 1.0465 1.2975 1.6736 2.0049 2.397 2.6700 
55 0.8483 1.0463 1.2972 1.6731 2.0041 2.396 2.6683 
56 0.8481 1.0461 1.2969 1.6725 2.0033 2.395 2.6666 
57 0.8480 1.0460 1.2967 1.6721 2.0025 2.393 2.6650 
58 0.8479 1.0458 1.2964 1.6716 2.0017 2.392 2.6633 
59 0.8478 1.0457 1.2962 1.6712 2.0010 2.391 2.6618 
60 0.8477 1.0455 1.2959 1.6707 2.0003 2.390 2.6603 
61 0.8476 1.0454 1.2957 1.6703 1,9997 2.389 2.6590 
62 0.8475 1.0452 1.2954 1.6698 1.9990 2.388 2.6576 
63 0.8474 1.0451 1.2952 1.6694 1.9984 2.387 2.6563 
64 0.8473 1.0449 1.2950 1.6690 1.9977 2.386 2.6549 
65 0.8472 1.0448 1.2948 1.6687 1.9972 2.385 2.6537 
66 0.8471 1.0447 1.2945 1.6683 1.9966 2.384 2.6525 
67 0.8471 1.0446 1.2944 1.6680 1.9961 2.383 2.6513 
68 0.8470 1.0444 1.2942 1.6676 1.9955 2.382 2.6501 
69 0.8469 1.0443 1.2940 1.6673 1.9950 2.381 2.6491 
70 0.8468 1.0442 1.2938 1.6669 1.9945 2.381 2.6480 
71 0.8468 1.0441 1.2936 1.6666 1.9940 2.380 2.6470 
72 0.8467 1.0440 1.2934 1.6663 1.9935 2.379 2.6459 
73 0.8466 1.0439 1.2933 1.6660 1.9931 2.378 2.6450 
74 0.8465 1.0438 1.2931 1.6657 1.9926 2.378 2.6640 
75 0.8465 1.0437 1.2930 1.6655 1.9922 2.377 2.6431 
76 0.8464 1.0436 1.2928 1.6652 1.9917 2.376 2.6421 
77 0.8464 1.0435 1.2927 1.6649 1.9913 2.376 2.6413 
78 0.8463 1.0434 1.2925 1.6646 1.9909 2.375 2.6406 


79 0.8463 1.0433 1.2924 1.6644 1.9905 2.374 2.6396 


Table A.2 Upper Percentiles of Student t Distributions (cont.) 
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a 

df 0.20 0.15 0.10 0.05 0.025 0.01 0.005 
80 0.8462 1.0432 1.2922 1.6641 1.9901 2.374 2.6388 
81 0.8461 1.0431 1.2921 1.6639 1.9897 2.373 2.6380 
82 0.8460 1.0430 1.2920 1.6637 1.9893 2.372 2.6372 
83 0.8460 1.0430 1.2919 1.6635 1.9890 2.372 2.6365 
84 0.8459 1.0429 1.2917 1.6632 1.9886 2.371 2.6357 
85 0.8459 1.0428 1.2916 1.6630 1.9883 2.371 2.6350 
86 0.8458 1.0427 1.2915 1.6628 1.9880 2.370 2.6343 
87 0.8458 1.0427 1.2914 1.6626 1.9877 2.370 2.6336 
88 0.8457 1.0426 1.2913 1.6624 1.9873 2.369 2.6329 
89 0.8457 1.0426 1.2912 1.6622 1.9870 2.369 2.6323 
90 0.8457 1.0425 1.2910 1.6620 1.9867 2.368 2.6316 
91 0.8457 1.0424 1.2909 1.6618 1.9864 2.368 2.6310 
92 0.8456 1.0423 1.2908 1.6616 1.9861 2.367 2.6303 
93 0.8456 1.0423 1.2907 1.6614 1.9859 2.367 2.6298 
94 0.8455 1.0422 1.2906 1.6612 1.9856 2.366 2.6292 
95 0.8455 1.0422 1.2905 1.6611 1.9853 2.366 2.6286 
96 0.8454 1.0421 1.2904 1.6609 1.9850 2.366 2.6280 
97 0.8454 1.0421 1.2904 1.6608 1.9848 2.365 2.6275 
98 0.8453 1.0420 1.2903 1.6606 1.9845 2.365 2.6270 
99 0.8453 1.0419 1.2902 1.6604 1.9843 2.364 2.6265 

100 0.8452 1.0418 1.2901 1.6602 1.9840 2.364 2.6260 
oo 0.84 1.04 1.28 1.64 1.96 2.33 2.58 


Source: Scientific Tables, 6th ed. (Basel, Switzerland: J.R. Geigy, 1962), pp. 32-33. 
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Table A.3 Upper and Lower Percentiles of x” Distributions 


x? distribution with 


df degrees of freedom 


Area=1-—p 


Pp 
df 0.010 0.025 0.050 0.10 0.90 0.95 0.975 0.99 
1 0.000157 0.000982 0.00393 0.0158 2.706 3.841 5.024 6.635 
2 0.0201 0.0506 0.103 0.211 4.605 5.991 7.378 9.210 
3 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 
4 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13,277 
5 0.554 0.831 1.145 1.610 9.236 11.070 12.832 15.086 
6 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 
7 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 
8 1.646 2.180 2.733 3.490 13.362 15.507 17.535 20.090 
9 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 
10 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 
11 3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 
12 3.571 4.404 5.226 6.304 18.549 21.026 23.336 26.217 
13 4.107 5.009 5.892 7.042 19.812 22.362 24.736 27.688 
14 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 
15 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 
16 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32,000 
17 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 
18 7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 
19 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 
20 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 
21 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 
22 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 
23 10.196 11.688 13.091 14.848 32.007 35.172 38.076 41.638 
24 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 
25 11.524 13.120 14.611 16.473 34.382 37.652 40.646 44.314 
26 12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 
27 12.879 14.573 16.151 18.114 36.741 40.113 43.194 46.963 
28 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 
29 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 
30 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 
31 15.655 17.539 19.281 21.434 41.422 44.985 48.232 52.191 
32 16.362 18.291 20.072 22.271 42.585 46.194 49.480 53.486 
33 17.073 19.047 20.867 23.110 43.745 47.400 50.725 54.776 
34 17.789 19.806 21.664 23.952 44.903 48.602 51.966 56.061 
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Table A.3 Upper and Lower Percentiles of x? Distributions (cont.) 


P 
df 0.010 0.025 0.050 0.10 0.90 0.95, 0.975 0.99 
35 18.509 20.569 22.465 24.797 46.059 49.802 53.203 57.342 
36 19.233 21.336 23.269 25.643 47.212 50.998 54.437 58.619 
37 19.960 22.106 24.075 26.492 48.363 52.192 55.668 59.892 
38 20.691 22.878 24.884 27.343 49.513 53.384 56.895 61.162 
39 21.426 23.654 25.695 28.196 50.660 54.572 58.120 62.428 
40 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 
41 22.906 25.215 27.326 29.907 52.949 56.942 60.561 64.950 
42 23.650 25.999 28.144 30.765 54.090 58.124 61.777 66.206 
43 24.398 26.785 28.965 31.625 55.230 59.304 62.990 67.459 
44 25.148 27.575 29.787 32.487 56.369 60.481 64.201 68.709 
45 25.901 28.366 30.612 33.350 57.505 61.656 65.410 69.957 
46 26.657 29.160 31.439 34.215 58.641 62.830 66.617 71.201 
47 27.416 29.956 32.268 35.081 59.774 64.001 67.821 72.443 
48 28.177 30.755 33.098 35.949 60.907 65.171 69.023 73.683 
49 28.941 31.555 33.930 36.818 62.038 66.339 70.222 74.919 
50 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 


Source: Scientific Tables, 6th ed. (Basel, Switzerland: J.R. Geigy, 1962), p. 36. 


F distribution with m 
and n degrees of freedom 


Area=1-p 


0 F 


pmn 


The figure above illustrates the percentiles of the F distributions shown in Table 
A.A. Table A.4 is used with permission from Wilfrid J. Dixon and Frank J. Massey, 
Jr., Introduction to Statistical Analysis 2nd ed. (New York: McGraw-Hill, 1957), 
pp. 389-404. 
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Table A.4 Percentiles of F Distributions 
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Table A.4 
Table A.4 Percentiles of F Distributions (cont.) 
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Table A.4 Percentiles of F Distributions (cont.) 
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95 3.94/3.87|3. 84/3. 81|3.77|/3.75|3. 74/3 .71/3. 70/3 .69|3 .68|3 . 67). 
.975 5.27/5.17/5.12/5.07/5.01/4.98)4.96)4.92|4.90|4.88/4. 86/4. 85). 
.99 \7.56|7.40)7.31|7. 23/7 . 14/7 .09)7 .06)6 . 99/6 . 97|6 . 93/6 . 90/6. 88) . 
.995 '9.81/9.59)9.47/9.36/9.24/9.17|9. 12/9 .03/9.00'8. 95/8. 91/8. 88}. 
.999 17.617. 1/16. 9|16.7|16.4 16.3]/16.2)16.0)16.0)15.9)15.8]15.7). 
.9995 22.421 g)21.7)21.4)21.1/20. 9|20.7|20.5/20.4/20.3/20.2/20.1). 
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Table A.4 Percentiles of F Distributions (cont.) 
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50 1.01}1 02/1 .03}1 03/1 .04/1.04]1 05/1 .05]1.05/1.05|1 06/1 .06 
75 1.48/1.47/1.46]1.45]1.45]1.44]1.44]1.43/1.43/1.43|1.42|1.42 
90 2.11/2.06'2.04)2.01)1.99}1 .97/1. 96]1 94/1 .93)1. 92/1 .91/1 90 
95 2.62/2.54/2.51/2. 47/2. 43/2. 40/2. 38/2.35/2.34/2.32|2.31/2.30 
(975 —_|3.18|3.07/3.02,2.96|2.91/2.87|2.85|2.80|2.79|2. 76/2. 74/2. 72 
99 '4.01)3.86/3 78/3. 70/3 .62.3.57|3.54/3.47|3. 45/3. 41|3. 38/3 .36 
995 4. 72'4.53/4.43/4. 3314. 23/4.174 124.04|4.013.97:3.93)3.90 
999 6.716.406. 256 .09/5.93'5.83'5.76.5.63.5.59\5.52'5.46,5.42 
9995 7. 74/7.37)7.187. 6. 80/6. 68'6.61|6. 45,6.41/6.33/6..25|6.20). 
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Table A.4 Percentiles of F Distributions (cont.) 
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Table A.4 Percentiles of F Distributions (cont. ) 


p 


m 


is 


/ 


| 20 | 24 


.197/.220 
.219}.242 
- 286) 308 
324) .346 
-389).410 
454) 474 


542). 561 
. 728}. 742 
-O1)1.02 
-41/1.41 
.92}1.90 


.33/2.39 


-88)3.79 
. 25/5 .10 
.93)5.75 
-211). 

. 233}. 
301). 


1 
1 
1 
2 
2 
3.37/3.29 
3 
5 
5 


He Bs 2 


30 


244 
266 
333 
.370 
433 
496 


581 


40 50 | 


. 272) .290 
. 294) .< 
. 360). 
. 397}. 


458). 
519). 


602). 


. 757 
1.02) 
1.40 
1.87 


2.25 


3. 
3. 
4. 
5. 


SOWNNHE See 


772). 782 
1.03}1.03 
1.39 
1.85 


2.20 
2.59 


Nace Be 
CN@CO =) 


‘354! . 


60 | 100 | 120 | 200 | 500 p fn 
.303} .330].339!.353|.368|.377|.0005| 15 
.325| .352|.360]. 375). 388) .398].001 
.389| 415]. 422). 435). 448). 457/005 
.425| 450}. 456]. 469). 483). 490) .01 
.485| 508) .514/.526|.538|. 546]. 025 
.545| .565|.571/.581|.592|.600|.05 

. 624 .641|.647]. 658) .667|.672|.10 

. 788) .802| . 805}. 812}. 818) .822).25 
1.03/1.04]1.04/1.04/1.04]1.05].50 
1.38]1.38]1.37|1.37|1.36]1.36].75 
1.82/1.79]1.79]1.77|1.76|1.76].90 
2. 16/2. 12/2. 11/2. 10/2.08 95 
2.52|2.47/2.46/2.44/2.41 .975 
3.05/2.98)2. 96/2. 92/2.89 99 
3.48/3.39|3.37|3.33/3.29 995 
4.64|/4.51/4.47|4.41/4.35 999 
5.2115. 06/5 .02/4.94/4.87 9995 
331|.364|.375|.391|. 408]. 422]. 0005] 20 
354 .386].395].413|.429].441].001 
419|.448]. 457]. 474]. 490 .005 
455] .483].491]. 508) .521 01 
514|.541).548).562|.575 025 
572| .595| 603]. 617]. 629 .05 

. 648} .671|.675|. 685) .694 .10 
801].816|. 820]. 827]. 835]. 840) .25 
1.02}1.03}1.03]1.03/1.03]1.03].50 
1.32/1.31/1.31/1.30]1.30/1.29].75 
1.68/1.65]1.64/1.63/1.62/1.61].90 
1.9511. 91/1.90]1. 8811 .8611.84].95 
2.22'2.17|2. 16/2. 13/2. 10|2.09].975 
2.61/2.54/2.52|2.48/2.44|2.42].99 
2.92/2.83/2.81/2.76|2.72/2.69].995 
3.70/3 .58)3 54/3. 48/3. 42/3.38].999 
4.07/3 93/3. 90/3 .82|3.75|3.70|.9995 
349| .384|.395|.416|.434 -0005| 24 
371|.405].417}. 437]. 455|.469|.001 
437|.469|.479|.498].515 005 
473) .505|.513). 529]. 546 01 
531|.562|.568|.585|.599 025 
588) .613|.622|. 637|.649 05 
.662| .685}.691|.704|.715).723]. 10 

. 809] . 825]. 829]. 837]. 844) .850].25 
1.02|1.02]1.02)1.02/1.03]1.03}.50 

1 29/1. 28112811 .27|1.27|1.26|.75 
1.61/1.58)1.57)1.56/1.54/1.53).90 
1.84]1.80}1.79)1.77|1.75]1.73] .95 
2.08/2.02)2.01/1.98)1.95/1.94].975 
2.40)2.33/2.31/2.27|2.24/2.21).99 
2. 66|2.57/2.552.50.2.46.2.43].995 
3. 29/3. 16'3. 14/3.07|3.01:2.97|.999 
3.5 +4433.41/3.33)3.273 .9995 


713 


714 Appendix Statistical Tables 


Table A.4 Percentiles of F Distributions (cont.) 
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Table A.4 Percentiles of F Distributions (cont. ) 


| T ] 
m | 
a 15 | 20 | 24 | 30 | 40 | 50 | 60 | 100 | 120 | 200 | 500} » | Pp n 
z 


01 311|.360| 388) 419] 454) 476]. 493] .529| 538) .559| 575) .590) .01 
025 —_|.378|.426|. 453). 482) .515| 535) .551/.585|.592|.610| .625) 639] .025 
05 445 .490| .516|.543| 573) .592| 606) .637| .644| .658| .676| 685) .05 
10 534] .575| .598) .623] .649] .667| .678| .704| .710| .725| .735] .746) . 10 
125 :716| .746| .763| .780) .798| 810) .818| 835) .839| 848) .856| 862) .25 
50 978). 989] .994|1 .00,1 01/1 .01/1 .01/1.02|1 .02|1 .02|1 .02/1 .02| 50 
:75 ——-{1..32/1..30|1.29]1.28|1 .27|1 26/1 26/1 25/1 24/1 24/1 .23|1 23) .75 
90 —-|1.72/1.67|1.64]1.61]1.57)1.55/1 .54/1.51/1 .50|1. 48|1.47|1 46.90 
95 __|2.01/1..93/1.89}1..84/1.79]1.76]1.74|1.70}1 .68|1 .66/1 .64)1 . 62) .95 
(975 |2.31/2.20|2. 14]2.07|2.01/1 97/1 94/1 88/1 .87|1.84]1.81)1.79] .975 
(99 —-|2..70.2..55)2.47/2. 39/2. 30/2. 25|2.21\2. 13/2. 11|2..07|2.03|2.03|.99 
995 |8.01/2.82'2.7312. 63/2. 52|2. 46/2. 42:2 .32)2 30/2. 25|2 21/2. 18). 905 
999 —|3.75|3..49/3.36|3..22|3 07/2. 9812. 92/2. 79/2. 76|2. 69/2. 63|2. 59] .999 
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50 -972| 983} .989| .994/1 .00|1 .00|1 .01|1 .01|1 .01|1.01|1 .02]1 .02| 50 
‘75 —-(|1..30|1.28]1.26]1 .25|1 24/1 23/1 .22)1 21/1 21/1 .20|1.19|1 19.75 
90 ——-|1.66|1.61|1.57|1.54/1.51|1 48/1 .47|1.43]1 42/1 .41]1.39)1.38 90 
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Table A.4 Percentiles of F Distributions (cont.) 
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Table A.5 Upper Percentiles of Studentized Range Distributions 


Studentized range distribution with 
k and v degrees of freedom 


1 18.0 27.0 32.8 37.1 40.4 43.1 45.4 47.4 49.1 50.6 52.0 53.2 34.3 55.4 56.3 
. 90.0 135 164 186 202 216 227 237 246 253 260 266 272 277 282 
2 6.09 8.3 9.8 10.9 11.7 12.4 13.0 13.5 14.0 14.4 14.7 15.1 15.4 15.7 15.9 
\. 14.0 19.0 22.3 24.7 26.6 28.2 29.5 30.7 31.7 32.6 33.4 34.1 34.8 35.4 36.0 
3 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46 9.72 9.95 10.2 10.4 10.5 10.7 
& 8.26 10.6 12.2 13.3 14.2 15.0 15.6 16.2 16.7 17.1 17.5 17.9 18.2 18.5 18.8 
4 3.93 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83 8.03 8.21 8.37 8.52 8.66 8.79 
. 6.51 8.12 9.17 9.96 10.6 11.1 11.5 11.9 12.3 12.6 12.8 13.1 13.3 13.5 13.7 
5 3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99 7.17 7.32 7.47 7.60 7.72 7.83 
5.70 6.97 7.80 8.42 8.91 9.32 9.67 9.97 10.2 10.5 10.7 10.9 11.1 11.2 11.4 
6 3.46 4.34 4.90 3.31 5.63 5.89 6.12 6.32 6.49 6.65 6.79 6.92 7.03 7.14 7.24 
5 5.24 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10 9.30 9.49 9.65 9.81 9.95 10.1 
7 3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16 6.30 6.43 6.55 6.66 6.76 6.85 
A 4.95 5.92 6.54 7.01 7.37 7.68 7.94 8.17 8.37 8.55 8.71 8.86 9.00 9.12 9.24 
8 3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92 6.05 6.18 6.29 6.39 6.48 6.57 
5 4.74 5.63 6.20 6.63 6.96 7.24 7.47 7.68 7.87 8.03 8.18 8.31 8.44 8.55 8.66 
9 3.20 3.95 4.42 4.76 5.02 5.24 5.43 5.60 5.74 5.87 5.98 6.09 6.19 6.28 6.36 
i 4.60 5.43 5.96 6.35 6.66 6.91 7.13 7.32 7.49 7.65 7.78 7.91 8.03 8.13 8.23 
10 3.15 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60 5.72 5.83 5.93 6.03 6.11 6.20 
4.48 5.27 5.77 6.14 6.43 6.67 6.87 7.05 7.21 7.36 7.48 7.60 7.71 7.81 7.91 
11 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49 5.61 5.71 5.81 5.90 5.99 6.06 
4.39 5.14 5.62 5.97 6.25 6.48 6.67 6.84 6.99 7.13 7.25 7.36 7.46 7.56 7.65 
12 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.40 5.51 5.62 5.71 5.80 5.88 5.95 
4.32 5.04 5.50 5.84 6.10 6.32 6.51 6.67 6.81 6.94 7.06 7.17 7.26 7.36 7.44 
13 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32 5.43 5.53 5.63 5.71 5.79 5.86 
4.26 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.67 6.79 6.90 7.01 7.10 7.19 7.27 


14 3.03 3.70 4.11 4.41 4.64 4.83 499 5.13 5.25 5.36 5.46 5.55 5.64 5.72 5.79 
a 4.21 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54 6.66 6.77 6.87 6.96 7.05 7.12 
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Source: Olive Jean Dunn and Virginia A. Clark, Applied Statistics: Analysis of Variance and Regression (New York: 
Wiley, 1974), pp. 371-372. Reproduced with permission of John Wiley & Sons, Inc. 
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Table A.6 Upper and Lower Percentiles of the Wilcoxon Signed 
Rank Statistic, W 


wt wi PW< wi)= P(W2 wy) 
n=4 0 10 0.062 
1 9 0.125 
n=5 0 15 0.031 
1 14 0.062 
2 13 0.094 
3 12 0.156 
ums 0 21 0.016 
1 20 0.031 
2 19 0.047 
3 18 0.078 
4 17 0.109 
5 16 0.156 
n=7 0 28 0.008 
1 27 0.016 
2 26 0.023 
3 25 0.039 
4 24 0.055 
5 23 0.078 
6 22 0.109 
7 21 0.148 
n= 8 0 36 0.004 
1 35 0.008 
2 34 0.012 
3 33 0.020 
4 32 0.027 
5 31 0.039 
6 30 0.055 
7 29 0.074 
8 28 0.098 
9 27 0.125 
n=9 1 44 0.004 
2 43 0.006 
3 42 0.010 
4 41 0.014 
5 40 0.020 
6 39 0.027 
7 38 0.037 
8 37 0.049 
9 36 0.064 
10 35 0.082 
11 34 0.102 
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Table A.6 Upper and Lower Percentiles of the Wilcoxon Signed 
Rank Statistic, W (cont.) 


wi Ww P(W<wj7)S P(W>w5) 
n= 10 3 52 0.005 
4 51 0.007 
5 50 0.010 
6 49 0.014 
7 48 0.019 
8 47 0.024 
9 46 0.032 
10 45 0.042 
11 44 0.053 
12 43 0.065 
13 42 0.080 
14 41 0.097 
15 40 0.116 
16 39 0.138 
n=11 5 61 0.005 
6 60 0.007 
7 59 0.009 
8 58 0.012 
9 57 0.016 
10 56 0.021 
11 55 0.027 
12 54 0.034 
13 53 0.042 
14 52 0.051 
15 51 0,062 
16 50 0.074 
17 49 0.087 
18 48 0.103 
19 47 0.120 
20 46 0.139 
h = 12 7 71 0.005 
8 70 0.006 
9 69 0.008 
10 68 0.010 
11 67 0.013 
12 66 0.017 
13 65 0.021 
14 64 0.026 
15 63 0.032 
16 62 0.039 
17 61 0.046 
18 60 0.055 
19 59 0.065 
20 58 0.076 
21 57 0.088 
22 56 0.102 
23 55 0.117 
24 54 0.133 


Source: Used with permission from Wilfrid J. Dixon and Frank 
J. Massey, Jr., Introduction to Statistical Analysis, 2nd ed. (New York: 
McGraw-Hill, 1957), pp. 443-444. 
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ANSWERS TO SELECTED ODD-NUMBERED QUESTIONS 


CHAPTER 2 
Section 2.2 


2.2.1. S={(s,5,5), (5,5, f), (fs), 5.5), AA), 
(if. AAS) AA AI 
A={5,f,59), iss}! B=th ft A} 
2.2.3. (1,3, 4), (1, 3,5), (1, 3, 6), (2, 3, 4), (2, 3, 5), (2, 3, 6) 
2.2.5. The outcome sought is (4, 4). 
2.2.7. P = {right triangles with sides (5, a, b): a? + b* =25} 
2.2.9. (a) S={(0,0,0,0), (0,0,0,1), (0,0,1,0), (0,0, 1, 1), 


(0,1,0,0), (0,1,0,1), (0,1,1,0), (0,1,1,1), (,0,0,0), 
(1,0,0,1),(1,0,1,0), €,0,1,1), (,1,0,0), €,1,0,1), 
(1,1, 1,0), (1, 1,1, D} 

(b) A = {(0,0,1,1), (0,1,0,1), (0,1,1,0), (1,0,0, 1), 
(1,0, 1,0), (1, 1,0, 0)} 

(c) 1+k 


2.2.11. Let p; and p, denote the two perpetrators and iy, in, 
and i3, the three in the lineup who are innocent. Then S = 
{(P1,41), (Pi,i2), (Pi, ts), (Po, ti), (pr, ta), (p2.t3), (Pi, P2)s 
(i1, 12), (i, %3), (ir, i3)}. The event A contains every outcome 
in S except (p,, p2). 

2.2.13. In order for the shooter to win with a point of 9, one 
of the following (countably infinite) sequences of sums must 
be rolled: (9,9), (9, no 7 or no 9,9), (9, no 7 or no 9, no 7 or 
no 9,9),.... 

2.2.15. Let A, be the set of chips placed in the urn at 1/2* 
minute until midnight. For example, A; = {11, 12,..., 20}. 
Then the set of chips in the urn is UJ, (Ay — {k}) =U, Ar — 
Ur {k} = 9%, since UJ, A; is a subset of L)™, {k}. 


2.2.17. AN B={x: -3<x<2}andAUB={x: —4<x <2} 
2.2.19. A => (Ay, ia) Ay) U (Ay a An) 
2.2.21. 40 


2.2.23. (a) If s is a member of AU(BNC), then s belongs 
to A or to BNC. If it is a member of A or of BNC, then 
it belongs to AU B and to AUC. Thus, it is a member of 
(AU B)N (AUC). Conversely, choose s in (AU B)N (AUC). 
If it belongs to A, then it belongs to AU(BNC). If it does not 
belong to A, then it must be a member of BNC. In that case 
it also is amember of AU(BNC). 

2.2.25. (a) Lets be amember of AU(BUC). Then s belongs 
to either A or BUC (or both). If s belongs to A, it necessarily 
belongs to (AU B) UC. If s belongs to BUC, it belongs to B 
or C or both, so it must belong to (AU B) UC. Now, suppose 
s belongs to (AU B) UC. Then it belongs to either AU B or C 
or both. If it belongs to C, it must belong to AU (BUC). If it 
belongs to A U B, it must belong to either A or B or both, so 
it must belong to AU(BUC). 

(b) The proof is similar to part (a). 

2.2.27. A is a subset of B 


2.2.29. (a) BandC 
(b) Bis asubset of A 


123 


2.2.31. 240 

2.2.35. A and B are subsets of AU B. 
2.2.37. 100/1200 

2.2.39. 500 


Section 2.3 


2.3.1. 0.41 

2.3.3. (a) 1— P(ANB) 

(b) P(B) — P(ANB) 

2.3.5. No. P(A;UA,UA;)= P(at least one “6” appears) = 

1— P(no 6’s appear) = 1 — (3)° #+. The A;’s are not mutu- 

ally exclusive, so P(A; U Az U A3) # P(A;) + P(A2) + P(A). 

2.3.7. By inspection, B= (BM A,) U(BN A2)U...U(BNA,). 
3 


2.3.9. 


2.3.11. 0.30 
2.3.13. 0.15 

2.3.15. (a) XCNY ={(H,T,T, H), (T,H,H,T)},80 P(X°N 
Y)=2/16 

(b) XN Y° ={(4,T,T,T), (1,T,T,H), 
(H, H, H,T)},so PX NY°)=4/16 

2.3.17. AN B,(ANB)U(ANC), A, AUB, S 


(T, H, H, A), 


Section 2.4 

2.4.1. 3/10 

2.4.3. If P(A|B) = SS < P(A), then P(AN B) < P(A) - 

P(B). It follows that P(B|A) = AD) NEE) 
P(A) P(A) 


P(B). 
2.4.5. The answer would remain the same. Distinguishing 
only three family types does not make them equally likely; 
(girl, boy) families will occur twice as often as either (boy, 
boy) or (girl, girl) families. 

2.4.7. 3/8 

2.4.9. 5/6 

2.4.11. (a) 5/100 (b) 70/100 
(e) 70/95 (f) 25/95 (g) 30/35 
2.4.13. 3/5 

2.4.15. 1/5 

2.4.17. 2/3 

2.4.19. 20/55 

2.4.21. 1800/360, 360; 1/360, 360 
2.4.23. 1/6, 497, 400 

2.4.25. 0.027 

2.4.27. 0.23 

2.4.29. 0.70 

2.4.31. 0.02% 

2.4.33. 0.645 


(c) 95/100 (d) 75/100 
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2.4.35. No. Let B denote the event that the person calling 
the toss is correct. Let Ay be the event that the coin comes up 
Heads and let A; be the event that the coin comes up Tails. 


Then P(B) = P(B|An) (An) + P(BIAr) (Ar) = 0.7) (5) ‘ 


1 1 
(0.3) (;) ig 
2.4.37. 0.415 
2.4.39. 0.46 
2.4.41. 5/12 
2.4.43. Hearthstone 
2.4.45. 0.74 
2.4.47. 14 
2.4.49. 0.441 
2.4.51. 0.64 
2.4.53. 1/3 


Section 2.5 
2.5.1. (a) No, PAN B)>0 (b) No, PAN B)=0.2403= 


P(A)P(B) (ce) 0.8 
2.5.3. 6/36 
2.5.5. 0.51 
2.5.7. (a) (1) 3/8 — (2):11/32 (b) (1) 0) (2) 1/4 
2.5.9. 6/16 


2.5.11. Equation 2.5.3: 


P(AN BNC) = P((, 3)}) = 1/36 = (2/6)(3/6) (6/36) 
= P(A) P(B) P(C) 


Equation 2.5.4: 
P(BAC) = PKC, 3), (5, 6)}) =2/36 F (3/6) (6/36) = P(B)P(C) 


2.5.13. 11 
2.5.15. P(AN BMC) =0 (since the sum of two odd numbers 
is necessarily even) # P(A)- P(B)- P(C) > 0, so A, B, and 
C are not mutually independent. However, P(AN B) = 36 = 
by PBY= 2 PURCS = AW HO =>, 
6 6 36 6 36 
and P(BONC)= = = P(B)- P(C)= ; : xe" so A, B, and C are 
pairwise independent. 
2.5.17. 0.56 
2.5.19. Let p be the probability of having a winning game 
card. 
Then 0.32 = P(winning at least once in 5 tries) 
= 1 — P(not winning in 5 tries) 
=1-—(1—p)°, so p=0.074 


2.5.21. 7 
2.5.23. 63/384 
2.5.25. 25 


WwW 


2.5.27. 
wt+r 
2.5.29. 12 


Section 2.6 


2.6.1. 2-3-2-2=24 

2.6.3. 3-3-5=45; included are aeu and cdx 
2.6.5. 9-9-8=648; 8-8-5=320 

2.6.7. 5-2’ =640 

2.6.9. 4-14-64+4-6-54+14-6-5+4-14-5=1156 
2.6.11. 2° — 1 = 255; five families can be added 
2.6.13. 2° —1=255 

2.6.15. 12-44+1-3=51 

2.6.17. 6-5-4=120 

2.6.19. 2.645 x 10° 

2.6.21. 2-6-5=60 

2.6.23. 4-19 P3 = 2880 

2.6.25. 6! -—1=719 

2.6.27. (2!)(8!)(6) = 483, 840 

2.6.29. (13!)* 

2.6.31. 9(8)4 = 288 

2.6.33. (a) (4!)(5!)=2880 (b) 6(4!)(5!) =17,280 
(c) (4!)(5!) =2880 (d) 2(9)8(7)6(5) = 30,240 


2.6.35. es OE 180 
"3113 ‘3 DN)? 


2.6.37. (a) 4!3!3!=864 (b) 3!4!3!3!=5184 (c) 10!=3,628,800 
(d) 10!/4!3!3! = 4200 
2.6.39. (2n)!/n\(2!)" =1-3-5....(2n—1) 
2.6.41. 11-10!/3! = 6,652,800 
2.6.43. 4!/2!2!=6 
2.6.45. 6!/3!3! = 20 

1 14! 
* 30 2121212131111! 
2.6.49. The three courses with A grade can be: 
English, math, French English, math, psychology 
English, math, history English, French, psychology 
English, French, history English, psychology, history 
math, French, psychology math, French, history 
math, psychology, history French, psychology, history 


10\ (15 
2.6.51. ( : ) ( ; ) = 95,550 
9 5\ (4 \ (5 
2.6.53. (a) (;) =126 (b) C) (;) = 18) (;) Le @ 2 
4) < 190 
(= 
2.6.55. ey i, 2= 126 


8\ 7! 
2.6.57. ~—— = 7350 
4) 21411! 


2.6.47 


= 30,270,240 


2.6.59. Consider the problem of selecting an unordered sam- 
ple of n objects from a set of 2n objects, where the 2n have 
been divided into two groups, each of size n. Clearly, we could 
choose n from the first group and 0 from the second group, or 
n— 1 from the first group and 1 from the second group, and so 


on. at a) on (") G) + (, : , () + 
+(0)()-20 ())=( seen (2) = 
z()) 


2.6.61. The ratio of two successive terms in the sequence is 


n n n—-j ; : ; ; : 
= . For small j, n— j > j +1, implyin 
(30/0) j+1 si i es 


. ; n—1 : 
that the terms are increasing. For j > =a though, the ratio 


is less than 1, meaning the terms are decreasing. 


2.6.63. Using Newton’s binomial expansion, the equation 
(4+1)¢-(+1)°=(1+ 0) can be written 


Cs) (Ep) 


j=0 j=0 


Since the exponent k can arise as 7°-f*, ¢'- t*',..., 


nF tats ((E)* (DOE) 
H(9G)=C2) 6 (7)-EQ0) 


j=0 
Section 2.7 


2.7.1. 63/210 


37 
190 
2.7.5. 10/19 (recall Bayes’s rule) 


2.7.7. 1/6"-! 
2.7.9. 2(n!)?/(2n)! 


2.7.3. 1 


2.7.11. 7!/7'; 1/7°. The assumption being made is that all pos- 
sible departure patterns are equally likely, which is probably 
not true, since residents living on lower floors would be less 
inclined to wait for the elevator than would those living on 
the top floors. 


2.7.13. 21° = 
10 


k\ 365-364---(365—k+2) 
2 (365)* 


2nat. (2)/(4) 


2.7.15. ( 


vf) (OVO 
2 HO QOMO/E) 
mo (OTC 


CHAPTER 3 


Section 3.2 


3.2.1. 0.211 
3.2.3. (0.23)? = 1/45, 600, 000 
3.2.5. 0.0185 


3.2.7. The probability that a two engine plane lands safely 
is 0.84. The probability that a four engine plane lands safely 
is 0.8208. 


3.2.9, n=6:0.67 n=12:0.62 n=18: 0.60 


3.2.11. The probability of two girls and two boys is 0.375. 
The probability of three of one sex and one of the other is 0.5. 


3.2.13. 7 
3.2.15. (1) 0.273 (2) 0.756 
3.2.17. Expanding [(p + (1 — p)]" gives 1=[(p+ (1 — p)]"= 


3s (i) pip)" 
3.2.19. 0.031 
3.2.21. 64/84 
3.2.23. 0.967 
3.2.25. 0.964 
3.2.27. 0.129 
3.2.31. 53/99 


Section 3.3 
3.3.1. (a) —_____—__ (b) 
k Px k) k Pv(k) 
2 1/10 3 1/10 
3 2/10 4 1/10 
4 3/10 5 2/10 
5 4/10 6 2/10 
‘ 2/10 
8 1/10 
9 1/10 
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3.3.3. 
3.3.5. 


3.3.7. 
3.3.9. 


k— 
3.3.11. poxiilk) = px ( 


k=1, 


: 
3.3.13. Fy(k) = > (;) ( 


3.3.15. See answer to Question 3.3.3. 


px(k) =k3/216 — (k — 1)3/216 

Px(3)=1/8; px (1) = 3/8; px(—1) = 3/8, px(—3) = 1/8 
4 
k 


px(2k—4)=( — 
k Px(k) 
0 4/10 
1 3/10 
2 2/10 
3 1/10 


SAD 


Section 3.4 


3.4.1. 
3.4.3. 
3.4.5. 


3.4.7. 


3.4.9. 


3.4.11 


1/16 
13/64 


(a) 0.135 (b) 0.23355 
Fy(yy=y* PY <1/2) 
Fy(y)= 


. (a) 0.693 (b) 0.223 (ce) 0.223 (d) fry) = 


l<y<e 


Aa, ot 
3.4.13. fi(y)= ay + 


Section 3.5 
—0.144668 


3.5.1, 
3.5.3. 
3.5.5. 
3.5.7. 
3.5.9. 
3.5.11 


3.5.15. 
3.5.17. 
3.5.19, 
3.5.21. 


3.5.23 


$28,200 


$227.58 
15 


9/4 years 


2 A/a 


10/3 
$10.95 
91/36 
. 5.8125 


j=0 


t+yt+ 


200 


200 
3.5.13. E(X)=) k ( ) (0.80)*(0.20)20°-« 
OG 
E(X) =np = 200(0.80) = 160 
$307,421.92 


3.5.27. (a) (0.5)=1 (b) 
3.5.29. E(Y) = $132 
3.5.31. $50,000 


3.5.33. Class average = 53.3, so the professor’s “curve” 
did not work. 


3.5.35. 16.33 


Section 3.6 


3.6.1. 12/25 
3.6.3. 0.748 
3.6.5. 3/80 
3.6.7. 1.115 
3.6.9. Johnny should pick (a + b)/2 
3.6.11. E(Y) = f>° yae“dy = 1/a. E(¥?) = fo y-he dy = 
2/*, using integration by parts. Thus, Var(Y) = 2/7 — 
(1/a)? = 1/22. 
3.6.13. E[(X —a)?] = Var(X) + (4 —a)? since E(X — pw) =0. 
This is minimized when a = yu, so the minimum of g(a) = 
Var(X). 
3.6.15. 8.7°C 

1 y-a 1 ’ 
3.6.17. (a) fr(y) = boa! ( ) = . The interval 


b-a b-—a 
where Y is non-zero is (b— a)(0) +a< y < (b—a)(1) +4, or 
equivalently a< y <b 
(b) Var(Y) = Var[(b— a)U +a] = (b —a)? Var(U) = (b —a)2/12 


3.6.19. 1/7 


r+1’ 
3.6.21. 9/5 

3.6.23. Let E(X) = mw and Var(X) = o*. Then 
E(aX + b) = aw + b and WVar(aX + b) = a’o’. 
Thus, the standard deviation of aX + b= ao and 


_ E[((aX+b)—@utb))']  @E[(X—p)'] 


(ao )3 ao 


E|(X-p) 
SET 25 
o: 


The demonstration for y, is similar. 
3.6.25. (a) c=5 (b) Highest integral moment = 4 


Section 3.7 


3.7.1. 1/10 
3.7.3. 2 


Obes) 
3.7.5. P(X=x,Y=y)= XJ \yJ/\3-x—-y 


O<x <3,0<y<2,x+y<3 

3.7.7. 13/50 

3.7.9. p.(0) = 16/36 p,(1)=16/36  p,(2) = 4/36 
3.7.11. 1/2 


3.7.13. 19/24 
3.7.15. 3/4 
3.7.17. px(0) = 1/2 py(1)=1/2 
Py(0)=1/8 px) =3/8  px(2)=3/8  px(3) = 1/8 


3.7.19. (a) fx(x)=1/2, O<x<2 
fryy=1, O<yK<l 

(b) fx(x) = 1/2, O<x<2 
fr(y) =3y’, O<y<1 

(c) fx@)=2@4+), O<x<l 
fry) =tyt+4, O<y<i 

(d) fxa)=xt+h, O<xK<l 
fryy=yt5, O<yK<l 

(e) fx) =2x, O<x<1 
fry) =2y, Osys<l 

() fx@)=xe*, O<x 
frY)=yer, O<y 

1 2 

(g) fx@) = (=) , O<x 

fryy=e”, O<y 


3.7.21. fx(x) =3—6x+3x?, O<x<1 

3.7.23. X is binomial with n = 4 and p = 1/2. Similarly, Y is 
binomial with n = 4 and p= 1/3. 

3.7.25. (a) {(H, 1), (H, 2), (H, 3), (H, 4), (H, 5), (H, 6), (T, 1), 
(T, 2), (T, 3), (T, 4), (T, 5), (T, 6} (b) 4/12 

3.7.27. (a) Fyy(u,v)=4uv?, O<u<2,0<v<1 

(b) Fyy(u,v)=twvtiuv’, O<u<1,0<v<1 

(c) Fyyu,vy=wv?, O<u<1,0<v<1 

3.7.29. fxy(x,y=1, O<x<10<y<l 

The graph of fx. (x, y) is a plane of height one over the unit 
square. 

3.7.31. 11/32 

3.7.33. 0.015 

3.7.35. 25/576 

3.7.37. fv.x(w,x)=4wx, O<w<1,0<x<1 

P(O<W <1/2,1/2<X <1)=3/16 

3.7.39. fx(x) =Ae*,0<x and fy(y) =Ae”,0< y 

3.7.41. Note that P(Y > 3/4) £0. Similarly P(X > 3/4) 40. 
However, (X > 3/4) 1 (Y => 3/4) is in the region where the 
density is 0. Thus, P((X > 3/4)N (Y > 3/4)) is zero, but the 
product P(X > 3/4) P(Y > 3/4) is not zero. 

3.7.43. 2/5 

3.7.45. 1/12 

3.7.47. P(O< X <1/2,0<Y < 1/2) =5/32 £ (3/8)(1/2) = 
PO<X <1/2)PO<Y<1/2) 

3.7.49. Let K be the region of the plane where fy y 40. 
If K is not a rectangle with sides parallel to the coordinate 
axes, there exists a rectangle A= {(x, y)|Ja<x<b,c<y<d} 
with AN K =@, but for A; = {(x, y)|a< x <), all y} and 
A,={(x, y)|allx,c<y<d},A,;,N K 4Gand A,NK 4G. Then 
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P(A) = 0, but P(A,) 40 and P(A,) 40. But A = A, M A, so 
P(A, Az) £ P(A1) P(A2). 

3.7.51. (a) 1/16 (b) 0.206 

(C) — fixy.xo.x3.X%4 C1, X2,%3,%4) = 256(x1%2%3X4)> Where 0 < 
X4,%X2,X3,X4<1 (d) Fy, x3 (X2, 3) = 7X, O<%x,x3<1 


Section 3.8 


Xx Zz 
3.8.1. (a) pxiy(w) = eat AFH” 


X +Y does belong to the same family. 

(b) pxsy(w)=(w —1)(1— p)’?p?, w=2,3,4, ... 

X +/Y does not have the same form of pdf as X and Y, 
but Section 4.5 will show that they all belong to the same 
family—the negative binomial. 


w=0,1,2,..., so 


483 a O0<w<l 

8.3. friy(W)= 

Pos Jaw 1<w<ed 

3.8.5. Fy(w) = P(W <w) = P(Y? <w)=P(Y < Jw) 


F Fy (./w) 
Sv (w) = Fy (w) = Fy W/W) = xe fr (w) 


3.8.7. 3(1— Jw), 0<w<l 


3.8.9. (a) fw(w)=-—Inw, O0<w<l 
(b) fy(w)=—4uInw, O<w<l 

2 
38. fy(w) = 7. 


Section 3.9 

394, et) 

393, 24H! 
9 18 6 


3.9.5. If and only if 0a; =1 


i=l 

3.9.7. (a) £(X;) is the probability that the i-th ball drawn is 
red, | <i <n. Draw the balls in order without replacement, 
but do not note the colors. Then look at the i-th ball first. The 
probability that it is red is surely independent of when it is 
drawn. Thus, all of these expected values are the same and 
=r/(r+w). 
(b) Let X be the number of red balls drawn. Then X = )° X; 

n i=1 
and E(X) => E(X;) =nr/(r+w). 

i=l 
3.9.9. 7.5 
3.9.11. 1/8 
3.9.13. 105/72 
3.9.15. E(X) = E(Y) = E(XY) =0. Then Cov(X, Y) =0. But 
X and ¥ are functionally dependent, Y = /1 — X?, so they are 
probabilistically dependent. 


3.9.17. 2/22 
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3.9.19. 17/324 
3.9.21. $6750, $373,500 
3.9.23. o <0.163 


Section 3.10 


3.10.1. 5/16 

3.10.3. 0.64 

3.10.5. PY, >m)=P Ui, mee Y, >m)= ey 
P(Y, >m)=1—P(¥, <m)=1— P(%,...,¥, <m) 
=1—P(¥; <m)-...- P(Y, <m)=1-(3)' 

If n > 2, the latter probability is greater. 


3.10.7. 0.200 

3.10.9. P (Yin > 20) = (1/2)" 
3.10.11. 0.725; 0.951 
3.10.15. 1/2 


Section 3.11 


Pxy(x,y) x +yt+xy 
Px(x) 3+5x 


eee) 
S413 Ge 


3.11.5. (a) k= 1/36 
x+1 


3.1L. py (y) = 


y=0,1,...,3-x 


b ,(1) = ——_., =1;2;3 
(b) py (1) wee. x 

XV+XZ+ YZ 
3.11.7. (x,y) = ———_.—., = 1,2. =11,°2 

Prsie(®sy) 9+ 122 : ee 
z=1,2 
31113. fx()=——>, O<y<l 
MoE s 

3.11.15. fy(y)=+(2y+2), O<y<l 
3.11.17. 2/3 
3.1L.19. fx) x.x5hcgx5 (Ky X2,%3) = 8X13, OSX), %2,43<51 


Note: the five random variables are independent, so the 
conditional pdf’s are just the marginal pdf's. 


Section 3.12 n=l mtg 
3.12.1. My(t)=E(e*) =) e*pr(ky=) ec — 


n 
k=0 k=0 
n-1 


= Ss yk 1-—e” 
~ noe” de) 


k=0 


1 
3.12.3. 300 (2+e°)” 
3.12.5. (a) Normal with w =0 and o? = 12 
(b) Exponential with 4 =2 
(c) Binomial with n =4 and p= 1/2 
(d) Geometric with p =0.3 


3.12.7. My(t) =e) 


3.12.9. 0 
3.12.11. My? (t)= Let"? = (at Bre" ?, so My? (0) =a 


M® (t) = (a ae bryrentern +4 prewttPr 2 ms 
M (0) =a? +b’. Then Var(Y) = (a? +b?) —a? =D’. 
3.12.13. 9 


b 
3.12.15 E(Y)=<+ 


2 


2 

3.12.17. My(t)= (<7) 
1-7/4 

3.12.19. (a) True 
(b) False 
(c) True 
3.12.21. Y is normally distributed with mean pz and variance 
o*/n. 
3.12.23. (a) My(t) = Max(t) = My(3t) = e**". This last 
term is not the moment-generating function of a Poisson 
random variable, so W is not Poisson. 
(b) My (t) = Myx41(t) =e! My Bt) = ee", This last term 
is not the moment-generating function of a Poisson random 
variable, so W is not Poisson. 


CHAPTER 4 
Section 4.2 


4.2.1. Binomial answer: 0.158; Poisson approximation: 0.158. 
The agreement is not surprising because n (= 6000) is so large 
and p (= 1/3250) is so small. 

4.2.3. 0.602 

4.2.5. For both the binomial formula and the Poisson approx- 
imation, P(X > 1)=0.10. The exact model that applies here is 
the hypergeometric, rather than the binomial, because p = P 
(ith item must be checked) is a function of the previous i — 1 
items purchased. However, the variation in p is likely to be 
so small that the binomial and hypergeometric distributions 
in this case are essentially the same. 

4.2.7. 0.122 

4.2.9. 6.9 x 107” 

4.2.11. The Poisson model p,(k) = e~°*°(0.435)*/k!, k = 
0, 1,... fits the data fairly well. The expected frequencies cor- 
responding to k = 0, 1, 2, and 3+ are 230.3, 100.4, 21.7, and 
3.6, respectively. 


0.363* 
4.2.13. The model p,(k) = cers fits the data well 


if we follow the usual statistical practice of collapsing low 
frequency categories, in this case k = 2, 3, 4. 


Number of Expected 
Countries,k Frequency p,(k) Frequency 
0 82 0.696 78.6 
1 25 0.252 28.5 
2+ 6 0.052 5.9 


The level of agreement between the observed and expected 
frequencies suggests that the Poisson is a good model for 


these data. 

4.2.15. If the mites exhibit any sort of “contagion” effect, the 
independence assumption implicit in the Poisson model will 
be violated. Here, x = Wg 155(0) +2001) +...+1(7)] =0.81, 
but py (k) =e °*'(0.81)*/k!, kK =0, 1,... does not adequately 
approximate the infestation distribution. 


No. of Infestations,k Frequency Proportion  p x(k) 
0 55 0.55 0.4449 
1 20 0.20 0.3603 
2: 21 0.21 0.1459 
3 1 0.01 0.0394 
4 1 0.01 0.0080 
5 1 0.01 0.0013 
6 0 0) 0.0002 
T+ 1 0.01 0.0000 
1.00 1.00 


4.2.17. 0.826 
4.2.19. 0.762 


4.2.21. (a) 0.076 

(b) No. P(4 accidents occur during next two weeks) = 
P(X = 4)- P(X =0) + P(X =3)- P(X =1)4+ P(X =2)- 
P(X =2) + P(X =1)- P(X =3) + P(X =0): P(X =4). 


~ | wat 98 | 
se*J1+—4+—4+—+4-:. 


May Aes 1 ; 
=e” -cosha=e“* (4) =-(1+e™). 


2 


2 


9° A 4x] 


re 
PG = = D ({)eta-p Let 
1 


4.2.25. 
xy=k 
yt+k 


y =x, — k. Then Pa =H = (77 


y=0 


Jota =p 


ete _ e (Ap) SS [AC — p)P - e*(Ap)* _ oll) 
(y +k)! kl! = y! k! 
_ ee? (Ap)* 
ar cae 
4.2.27. 0.50 
4.2.29. 28 
Section 4.3 
4.3.1. (a) 0.5782 (b) 0.8264 (ce) 0.9306 (d) 0.0000 


ato | 2 
4.3.3. (a) Botharethesame (b / —e* dz 
(a) (b) iw or 


Tv 
4.3.5. (a) —0.44 
(e) 0.95 


(b) 0.76 (ec) 0.41 (d) 1.28 
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4.3.7. 0.0655 

4.3.9. (a) 0.0053 (b) 0.0197 

4.3.11. P(X > 344) = P(Z > 13.25) = 0.0000, which strongly 
discredits the hypothesis that people die randomly with 
respect to their birthdays. 

4.3.13. The normal approximation does not apply because 
the needed condition n > 9p/(1 — p) = 9(0.7)/0.3 = 21 does 
not hold. 

4.3.15. 0.5646 

For binomial data, the central limit theorem and DeMoivre- 
Laplace approximations differ only if the continuity correc- 
tion is used in the DeMoivre-Laplace approximation. 

4.3.17. 0.6808 

4.3.19. 0.0694 

4.3.21. No, only 84% of drivers are likely to get at least 
25,000 miles on the tires. 

4.3.23. 0.0228 
4.3.25. (a) 6.68%; 
4.3.27. 434 

4.3.29, 29.85 
4.3.31. 0.0062. The “0.075%” driver should ask to take the 
test twice; the “0.09%” driver has a greater chance of not 
being charged by taking the test only once. As n, the num- 
ber of times the test is taken, increases, the precision of the 
average reading increases. It is to the sober driver’s advan- 
tage to have a reading as precise as possible; the opposite is 
true for the drunk driver. 

4.3.33. 0.23 

4.3.35. 0 =0.22 ohms 


15.87% 


Section 4.4 


4.4.1. 0.343 
4.4.3. No, the expected frequencies (= 50- px(k)) differ con- 
siderably from the observed frequencies, especially for small 
values of k. The observed number of 1’s, for example, is 4, 
while the expected number is 12.5. 

[r] ita 
4.4.5. Fy(t)= P(X <1)=p > (1—p)’. But )>(1— py’ 

s=0 s=0 
Ade p). 1=dsp 
~ 1-(-p) pp 


n+l 
44.7, Pn<Y¥<n+l1)= / he dy =(1-e””) 


, and the result follows. 


n+1 


n 


=e" —e*") —e"(] —e”). Setting p=1—e” gives Pin< 
Y<n+1)=(1—p)"p. 


Section 4.5 


4.5.1. 0.029 

4.5.3. Probably not. The presumed model, py(k) = 
Cs (iy Ge k=2, 3, ... fits the data almost perfectly, as 
the table shows. Agreement this good is often an indication 


that the data have been fabricated. 
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k  px(k) Obs. Freq. Exp. Freq. 
2 1/4 24 25 
3 2/8 26 25 
4 3/16 19 19 
5 4/32 13 12 
6 5/64 8 8 
7 6/128 5 5 
8 7/256 3 3 
9 8/512 1 2 
10 9/1024 1 1 


~ 


4.5.5. BO0=Doa(C i) py" 


k=r 


==) md py 


4.5.7. The given X = Y —r, where Y has the negative bino- 
mial pdf as described in Theorem 4.5.1. Then E(X) = ee p= 
P 


r(l—p) r(_ =a _ P f 
= | 
Pp 


, Var(X) = 
et r-l 
ne] {pe'[1— (1— p)e'}? 
pe 


1-(- 
(1— pe’ +[1— (1— p)e'}"' pe}. When t =0, MY (0) = E(X) 


4.5.9. MY (t)=r | 


_ 
=| ees 
P P P 
Section 4.6 
001)? 
4.6.1. froy= ge 26-0.001y O<y 


4.6.3. If E(Y) = = 1.5 and Var(Y) = = = 075, then r=3 
and A=2, which makes f(y) =4y’e*”, y > 0. Then P(1.0< 
Y, < 2.5) = “ay? e’dy =0.55. Let X = number of Y,’s in 
1.0 

the interval (1.0, 2.5). Since X is a binomial random variable 
with n= 100 and p =0.55, E(X) =np=55. 
4. 6. 5. Setting the first derivative of f,” equal to 0 gives 

ev {— ah 1 As (r— lye y=0 
P(r) 
which implies that (r — 1)\~ 2 =) ‘so y= — is a mode. Its 
uniqueness follows from the fact that the second derivative 


of fy (y) . negative ~ all other y for which f;(y) is defined. 
¥ ONS fi G/)= A 2)" eho) 
‘ i T(r) 


r-l1,-y 


“Tw” 


4.6.7. 1(5) = 30(5) = 350(5) = 3330 (4) =e (3) by Theo- 
rem 4.6.2 (2), and (+) =./x by Question 4.6.6. 

4.6.9. Write the gamma moment-generating function 
in the form M,(t) = (i — t/a)’. Then M{?(t) = —rd — 
t/MT MH 1/A) = (P/AA = t/A" and My'(t) = (r/A) 


(rh — 10. -4fAy? = ly = PAG + DO = Ayr, 


Therefore, E(Y) = M0) = and Var(Y) = M20) — 
Mo rrv+l) Pr sal 
[My (= a2 ~ 42 ~ 22° 
CHAPTER 5 
Section 5.2 
5.2.1. 5/8 
5.2.3. 0.122 
5.2.5. 0.733 
5.2.7. 8.00 
5.2.9. (a) 24=[0(6) + 1(19) + 2(12) + 3(13) + 4(9)]/59 = 2.00 
Number of Observed Expected 
No-hitters Frequency Frequency 
0 6 8.0 
1 19 16.0 
2 12 16.0 
3 13 10.6 
4 9 8.4 


(b) The agreement is reasonably good considering the num- 
ber of changes in baseball over the 59 years—most notably 
the change in the height of the pitcher’s mound. 


5.2.11. Yinin 


5.2.13. a 


—25Ink +) Iny,; 


i=l 


5.2.15. 0,=02 =! (y, — uy 


5.2.17. 


5.2.19. 1/5 
5.2.21. 3/(¥—k) 


pe : 
Dia 


1 
5.2.25. E(X)=1/p and p,= x. For the given data, p, =0.479. 
xX 


The expected frequencies are: 


5.2.23. 


X Observed Frequency Expected Frequency 


at 132 119.8 
2 52 62.4 
3 34 32.5 
4 9 16.9 
5 7 8.8 
6 > 4.6 
7 5 2.4 
>8 6 2.6 


Section 5.3 


5.3.1. Confidence interval is (103.7, 112.1). 


5.3.3. The confidence interval is (64.432, 77.234). Since 80 
does not fall within the confidence interval, that men and 
women metabolize methylmercury at the same rate is not 
believable. 


5.3.5. 336 
5.3.7. 0.501 


5.3.9. The interval given is correctly calculated. However, 
the data do not appear to be normal, so claiming that it is a 
95% confidence interval would not be correct. 


5.3.9. (0.316, 0.396) 
5.3.11. (0.254, 0.300) 


5.3.13. Since 0.54 does not fall in the confidence interval of 
(0.61, 0.65), the increase could be considered significant. 


5.3.15. 16,641 


5.3.17. Both intervals have confidence level approximately 
50%. 


5.3.19. The margin of error is correct at the 95% level. The 
confidence interval is (0.559, 0.621). 
5.3.21. In Definition 5.3.1, substitute d= SYN < a 

n 
5.3.23. For margin of error 0.06, n = 267. For margin i error 
0.03, n = 1068. 


5.3.25. The first case requires n = 421; the second, n = 479. 
5.3.27. 1024 


Section 5.4 


5.4.1. 2/10 
5.4.3. 0.1841 


= 1 n 1 n 1 n 
5.4.5. E(X)=E{—- X;J=- E(X,)=- A=AK 
(a) E(x) (73>) pe De 
(b) In general, the sample mean is an unbiased estimator of 
the mean jw. 


5.4.7. By Theorem 3.10.1, fy,,, (vy) =ne"",, 6 < y. Then 


E(Ynin) = / y-ne° dy 
6 


=) (u+6)-ne“ au= [ u-ne“ du 
0 0 
> 1 


and E (Yimin — 
5.4.9. 1/2 
5.4.11. E(W2) = Var(W) + E(W)? = Var(W) + 62. Thus, W? is 
unbiased only if Var(W) = 0, which in essence means that W 
is constant. 

5.4.13. The median of 6 is 


ifn=1. 


(n+ 1) 
nJ/2 


8, which is unbiased only 
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2 


°) = Var(W) + £(W)’ =" +p, so lim £(W?) 


n—>0o 


5.4.15. E(W 


a 
= lim (S40 )=u 
n>o\ n 
x ye Xx 1 
5.4.17. (a) E(p,\)=E(X,) =p. E(p)=E( =) =2 20) 


= —np=p, so both p, and p, are unbiased estimators of p. 

n 
(b) Var(pi) = pC — p); Var(p2) = p( — p)/n, so Var(p2) is 
smaller by a factor of n. 

5.4.19. (a) See the solution to Question 5.4.14. 
(b) Var(6,) = Var(Y>) = 62, since Y, is a gamma variable with 
parameters 1 and 1/0. 

Var (6) = Var(Y) = 62/n. 

From the solution to Question 5.4.14, it follows that 
nY pin is A gamma variable with parameters 1 and 6*/n*, so 
Var(63) = Var(1Y min) = 07 /n?. 

(c) Var(6;)/Var(,) = ((6?/n?)/0?) = 1/n? 
Var(@;)/ Var (62) = (62/n2)/(6?/n) = 1/n 


‘“ 1 g? 
5A21: Var(d,) =Var( “t*¥\a 
n n(n +2) 
Vai Salas ino=—— 
ar(0,) = Var((n nin) = 
: (n +2) 


a 2 
Var (2) /Var(6, = a +2) ao / ae +2) = 
Section 5.5 


5.5.1. The Cramer-Rao bound is 6?/n. Var(@) = Var(Y) = 
Var(Y)/n =67/n, so 6 is a best estimator. 
5.5.3. The Cramer-Rao bound is o?/n. Var(ji) = Var(Y) = 


Var(Y)/n =0°*/n, so j1 is an efficient estimator. 


5.5.5. The Cramer-Rao bound is Gene . Var(6) = Var(X) = 
n 
(0—1)0 a, : 3 
Var(X)/n = , SO @ is an efficient estimator. 
n 
2 f foe} 
55.7. £ a7 In fy (W; @) z ia d (dln fy(w; 9) 
06? 00 00 
x fw (w; 0) dw 
7 [ a 1 Afw(w;) 
done 86 \ fw(w3) 88 
x fw (w; 0) dw 


0° fw; 9) 


2 1 
7 [. Far 6) ae? 


1 (ae) 
(fw (w; @))? 00 


x fw (w; 0) dw 
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oo 42 . 
= 0° fw(w; 6) tas 


oo 00? 
1 afw(w; 0) \" 
a Gam ( a6 ) 
x fw(w; 6) dw 
= [ (ee) 
Lid EY) 
x fw(w; 0) dw 


The 0 occurs because 1 | fw(w; 9) dw, so 


vf Fw; @) dw =f 0° fw (w; 0) 
~ghtsts. Sd? 


dw 


06? 


The above argument shows that 


: (eee >) cee (eo 
062 00 


Multiplying both sides of the equality by n and inverting gives 
the desired equality. 


Section 5.6 


n n D> kj inf 
5.6.1. [| px(kis p) =[[U — p) 'p=d _ yh p". 


i=l i=1 
n (S5)- 
Let g( oki: 2) =(1—p)\=1') "p" and u(ky, ko, ....k,) = 1. 
i=l 


By Theorem 5.6.1, >> X; is sufficient. 

i=l 
5.6.3. In the discrete case, and for a one-to-one function g, 
note that P(X; =x, X2=42,...,Xq =x,|¢(6) =6,) = 
P(X, =x, Xo =%0,...,X, =x |0=2°'()) : 
The right hand term does not depend on @, because 6 is 
sufficient. 


gr 


yy 
5.6.5. The likelihood function is E aa 


r-l 
1 - a . “8 
iwoDI (i [>] so )-Y, is a sufficient statistic for 0. 
r—I1)ty" 
i=l 


i=1 


So also is 1Y. (See Question 5.6.3.) 


5.6.7. (a) Write the pdf in the form f(y) =e-°™ = Ipp.co() 
where I[j.0)(y) is the indicator function introduced in 
Example 5.6.2. Then the likelihood function is 


n n 


ae = DN E 
LO@)=[ [eo - foci) =e te” | | Mo,003(01) 


i=l i=l 


But [] Jy.) 0%) = Io,001 min), SO the likelihood function 
i=l 


factors into 
-y»\ 
L@)= fe = | [e, I6,00}(Ymin)] 


Thus the likelihood function decomposes in such a way that 
the factor involving 6 only contains the y,’s through Yin. By 
Theorem 5.6.1, Ymin 18 sufficient. 

(b) We need to show that the likelihood function given y,,,, 
is independent of 6. But the likelihood function is 


Dy, 
Fle = ee = ifO<y,,¥2,...Vp 
i=l 0 otherwise 


Regardless of the value of ymax, the expression for the likeli- 
hood does depend on @. If any one of the y,, other than ymax, 
is less than 0, the expression is 0. Otherwise it is non-zero. 


n (£ Ko) @)4n0) 5 S(w;) 
5.6.9. |] gw(wis0) = | e\=! ei=! , so 
i=l 


)° K(W,) is a sufficient statistic for @ by Theorem 5.6.1. 


i=l 
5.6.11. 6/(1 + y)t! =elmice-D+? Take K(Y¥) =In(1 + y), 
p(0) =—6 — 1, and g(@) = Ind. Then 3 K(Y;) = yin £%) 
is sufficient for 0. - = 
Section 5.7 

5.7.1. 17 

5.7.3. (a) P(Y, > 2A) = f 


2h 


Al <A/2) <1 —e” <1. Thus, lim P(|Y, — Al <4/2) <1. 


foe) 


ae dy=e, Then P(|Y; — 


i=1 


(b) P bs Y¥,>24) > P(Y, > 24) =e. The proof now 
proceeds along the lines of Part (a). 


6 n-1 
= _ep (2 
5.7.5 E[(Ymax — 9)7] =| (y 9) 0 (5) dy 


) 


n (ent = 260y" +6?y""') dy 


0” Jo 


n (or? 
~ @n & 
-(5-stije 

n+2 n+l 


Then lim E[(Ynx — 6)7] = li eacclaee 
im max — = hm = 
n—>0o noo \n+2 n+l 


and the estimator is squared error consistent. 


Jent2 grt 
=) 
n+1 n 


Section 5.8 
5.8.1. The numerator of go(6|X =k) is 
= rig, EO +5) 2-1 sal 
px(k|@) fo(O) =[C. — 4) Nm)” (1-6) 
= rr +s) ar(1 a oystk-2 
T(r)T(s) 


The term 6’(1 — 0)°+*-? is the variable part of the beta distri- 
bution with parameters r + 1 and s+ — 1, so that is the pdf 
8o(O| X = k). 
5.8.3. (a) The posterior distribution is a beta pdf with 
parameters k + 135 andn—k+ 135. 
(b) The mean of the Bayes pdf given in part (a) is 
k+135 _ k+135 
k+1354+n—k+135  n+270 


on k ie 270 «(135 
~n+270 \n n+270 \ 270 


oon k % 270 (1 
~ n+270 \n n+270 \2 


5.8.5. In each case the estimator is biased, since the mean 
of the estimator is a weighted average of the unbiased maxi- 
mum likelihood estimator and a non-zero constant. However, 
in each case, the weighting on the maximum likelihood esti- 
mator tends to 1 as n tends to ov, so these estimators are 
asymptotically unbiased. 


5.8.7. Since the sum of gamma random variables is gamma, 
then W is gamma with parameters nr and 4. Then go(6|X =k) 


is a gamma pdf with parameters nr +s and }- y; + uw. 


i=l 


5.8.9. Px (k|@) fo (0) = ( ; ) mate k+r-1 al gyn kts so 
= n re + 5) k+r-1 n—k+s—1 
paikiey =( k ) Pore) ; 6 (1-0) dé 


{nan \ Pots) TK+rHPra—k+s) 
=(; PO) (s) Ta+r+s) 


_ n!} (r+s—1)! 
~ k(n—k)! (Fr — DMs — 1! 


(kK+r—D!n—k+s—1)! 
(n+r+s-—1)! 


_ &tr-DiGa—k+s—D! mets—D! 
~ KD! n—bis—D! M+rts—D! 


_({ kt+r-1 n—-k+s—-1 n+rt+s—-1 
7 k n—k n 


CHAPTER 6 

Section 6.2 

6.2.1. (a) Reject Hp if we <-1.41 1.61; reject 

21. ——— 5 —1.41,7=—1.01,; : 
J 0 18/25 J 0 


y— 42. 
(b) Reject Hp if aa is either 1) < —2.58 or 2) > 2.58; 


3.2//16 


z= 2.75; reject Ho. 
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: . — 14. : 

c) Reject A if — > 1.13; z=1.17; reject Hp. 
(c) J 0 4.1/V5 J 0 
6.2.3. (a) No (b) Yes 
6.2.5. No 
6.2.7. (a) Hy should b ected =" eat (1) 

2.7. (a should be rejected if ———— is either 

} : 0.4/./30 


< —1.96 or (2) > 1.96. But ¥ = 12.76 and z = 2.19, suggesting 
that the machine should be readjusted. 

(b) The test assumes that the y;’s constitute a random sample 
from a normal distribution. Graphed, a histogram of the 30 
y,;’s shows a mostly bell-shaped pattern. There is no reason 
to suspect that the normality assumption is not being met. 
6.2.9. P-value = P(Z < —0.92) + P(Z > 0.92) = 0.3576; Ho 
would be rejected if a had been set at any value greater than 
or equal to 0.3576. 

F— 145.75 ne 
9.50//25 

2) = 1.96. Here, y = 149.75 and z= 2.10, so the difference 
between $145.75 and $149.75 is statistically significant. 


1.96 or 


6.2.11. Hy should be rejected if 


Section 6.3 


6.3.1. (a) z=0.91, which is not larger than zo; (= 1.64), so Hp 
would not be rejected. These data do not provide convincing 
evidence that transmitting predator sounds helps to reduce 
the number of whales in fishing waters. 

(b) P-value = P(Z > 0.91) =0.1824; Hy) would be rejected for 
any a > 0.1814. 


72 — 120(0. 
6.3.3. 2= eS = 


/120(0.65) (0.35) 
—Zos (= —1.64), so Ho: p = 0.65 would not be rejected. 
6.3.5. Let p = P(Y; < 0.69315). Test Hy: p = + versus 
H,: p # 5. Given that x = 26 and n = 60, the P-value = P(X < 
26) + P(X > 34) =0.3030. 
6.3.7. Reject Hp if x > 4 gives a = 0.50; reject Hy if x > 5 gives 
a = 0.23; reject Hy if x > 6 gives a = 0.06; reject Hy if x >7 
gives a=0.01. 
6.3.9. (a) 0.07 


1.15, which is not less than 


Section 6.4 


6.4.1. 0.0735 

6.4.3. 0.3786 

6.4.5. 0.6293 

6.4.7. 95 

6.4.9. 0.23 

6.4.11. a = 0.064; 6 =0.107. A Type I error (convicting an 
innocent defendant) would be considered more serious than 
a Type II error (acquitting a guilty defendant). 

6.4.13. 1.98 

6.4.15. </0.95 


6.4.17. 1—B=(4)'" 
6.4.19. 2 
6.4.21. 0.63 
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Section 6.5 


6.5.1. A= max L(p)/max L(p), where max L(p) = pj(1— 


n n n Hy Sh=n 
Yi kj-n i=l 
po)! and max L(p) = (»/3%) = (»/4)| 
i=l i=1 


5S Oi-n0)? 1S Gy, —52 
6.5.3. 4= Qny re a" i | (nye 2B 


= oH (6—nor/( tv)”. 
Base the test on z = (¥ — Mo)/(1/Vn). 
6.5.5. (a) A= (5)"/[(k/n)*§(1 —k/ny*] = 2 k(n — kn", 
Rejecting Hy) when 0 <A < A* is equivalent to rejecting Hy 
when kInk + (n—k)In(n—k) > A™. 
(b) By inspection, k Ink + (n — k) In(n — k) is symmetric in k. 
Therefore, the left-tail and right-tail critical regions will be 
equidistant from p = , which implies that H, should be 
rejected if |k — | >c, where c is a function of a. 


CHAPTER 7 


Section 7.3 


7.3.1. Clearly, fy (u) > 0 for all u > 0. To verify that fy(u) is a 
pdf requires proving that f>~ fu(u) du=1. But [> fu(u) du= 
a Jo pre ?du = TED fo Gy? te? du/2) = 


e~*dv, where v= £ and dv= &. By definition, 


1 _ f® ..n/2-1 
ry) ~ Jo v 


T(2) = fo "lev. Thus, J.” fuwdy = 4, -T(4)=1. 


T(n/2) 


ie 
7.3.3. If w~ = 50 and o = 1055 4, should have a 
i=l 


10 


xX; distribution, implying that the numerical value of 
the sum is likely to be between, say, x}; ,(= 0.216) and 


5, 
X3rsa(= 9.348). Here, >> (2%)? = (S8)* 4 (382)? 4 
i=l 


10 10 10 


(S2)" = 6.50, so the data are not inconsistent with the 


hypothesis that the Y;,’s are normally distributed with ~ = 50 
and o = 10. 


7.3.5. Since E(S’) =o", it follows from Chebyshev’s inequal- 
ity that P(|S? —o?| <e) > 1— “S9. But Var(S?) = 2% > Oas 
n—> oo. Therefore, S” is consistent for o?. 


7.3.7. (a) 0.983 
(b) 0.132 
(c) 9.00 
7.3.9. (a) 6.23 
(b) 0.65 
(c) 9 
(d) 15 
(e) 2.28 
V/m 


7.3.11. F = 7, where V and U are independent x? vari- 


ables with m and n degrees of freedom, respectively. Then 
+ = Wa» Which implies that ; has an F distribution with n 


and m degrees of freedom. 


7.3.15. Let T be a Student ¢ random variable with n degrees 
of freedom. Then E(T*) = c/ at where 
“= (1+2) 


C is the product of the constants appearing in the defini- 
tion of the Student rt pdf. The change of variable y =t/./n 


: s 2ky _ 2k 
results in the integral E(T*) =C / 7 y apyyeF 
some constant C*. Because of the symmetry of the integrand, 
co 2k 


dy for 


E(T*) is finite if the integral dy is finite. But 


o CU Hye es 
fo} 2k ~ 1+y 

/ = anady < f ( 3 ae 

o (1+ y7) o (+y’) 


2, 1 1 
ze ora f ae 
i a+ So ayy a 


—2k 1 
To apply the hint, take a =2 and 6 = : + 5 Then 
2k <n, B > 0, and af > 1, so the integral is finite. 
Section 7.4 
7.41. (a) 0.15 
(b) 0.80 
(c) 0.85 


(d) 0.99 —0.15 =0.84 

7.4.3. Both differences represent intervals associated with 
5% of the area under f;, (t). Because the pdf is closer to the 
horizontal axis the further t is away from 0, the difference 
tosn — tion is the larger of the two. 

7.4.5. k=2.2281 

7.4.7. (0.869, 1.153) 

7.4.9. (a) (30.8 yrs, 40.0 yrs) 

(b) The graph of date versus age shows no obvious patterns 
or trends. The assumption that jz. has remained constant over 
time is believable. 

7.4.11. (175.6, 211.4) 

The medical and statistical definition of “normal” differ 
somewhat. There are people with medically normal platelet 
counts who appear in the population less than 10% of the 
time. 

7.4.13. No, because the length of a confidence interval for 
pw is a function of s as well as the confidence coefficient. If 
the sample standard deviation for the second sample was 
sufficiently small (relative to the sample standard deviation 
for the first sample), the 95% confidence interval would be 
shorter than the 90% confidence interval. 

7.4.15. (a) 0.95 

(b) 0.80 

(c) 0.945 

(d) 0.95 

7.4.17. Obs. t = —1.71; —tos.1s = —1.7341; fail to reject Ho 
7.419. Test Hj: w = 40 vs. A): uw < 40; obs. t = —2.25; 
—tos.14 = —1.7613; reject Hp. 

7.4.21. Test Hy: « = 0.0042 vs. Hi: ww < 0.0042; obs. t = 
—2.48; —tos,9 = —1.8331; reject Hp. 


7.4.23. Because of the skewed shape of fy(y), and if the 
sample size were small, it would not be unusual for all the 
y;’s to lie close together near 0. When that happens, y will be 
less than jz, s will be considerably smaller than E(S), and the t 
ratio will be further to the left of 0 than f;,_, (t) would predict. 


7.4.25. f2(z) 


Section 7.5 


7.5.1. (a) 23.685 
(b) 4.605 
(c) 2.700 
7.5.3. (a) 2.088 
(b) 7.261 
(c) 14.041 
(d)_ 17.539 


7.5.5. 233.9 
75.7. Bue Be 
< 


ae (n—1)S2 o < ge 1)s2 . 66 (3 (n—1)s2 ; -u2 ) ey 
4 —a/2,n—1 XG pon 1 4 we 1 %a/2,n-1 

100(1 — a)% confidence interval for o*. Taking the square 
root of both sides gives a 100(1 — a)% confidence interval 
foro. 

7.5.9. (a) (20.13, 42.17) 

(b) (0, 39.16) and (21.11, oo) 

7.5.11. Confidence intervals for o (as opposed to o*) are 
often preferred by experimenters because they are expressed 
in the same units as the data, which makes them easier to 
eee. 


ec (ne as _ = 
= Xian = l-as= 


= 261.92, and s =9.8. 


re 305 


75.15. Test Hy: o” = 30.47 versus H,: o? < 30.4* The test 
Pee eee . . 2. 
statistic in this case is x* = “7 = ACS = 14.285. 


The critical value is x3,,_, = X}s1g = 9.390. Accept the 
null hypothesis, so do not assume that the potassium-argon 
method is more precise. 


7.5.17. (a) Test Ho: w= 10.1 versus Hi: uw > 10.1 
Test statistic is + a= TTT Va = = 0.674. Critical value is 
to.n—1 = t0.05,23 = 1.7139. 

Accept the null hypothesis. Do not ascribe the increase 
of the portfolio yield over the bench mark to the analyst’s 
system for ehoosing stocks. 

(b) Test Ho: o? = 15.67 versus H,: 0” < 15.67 

Test statistic is x? = Bao?) — = 9.688. Critical value is x55; = 
13.091. 

Reject the null hypothesis. The analyst’s method of choosing 
stocks does seem to result in less volatility. 


CHAPTER 8 


Section 8.2 


8.2.1. Regression data 
8.2.3. One-sample data 
8.2.5. Regression data 
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8.2.7. k-sample data 
8.2.9. One-sample data 
8.2.11. Regression data 
8.2.13. Two-sample data 
8.2.15. k-sample data 
8.2.17. Categorical data 
8.2.19. Two-sample data 
8.2.21. Paired data 
8.2.23. Categorical data 
8.2.25. Categorical data 
8.2.27. Categorical data 
8.2.29. Paired data 
8.2.31. Randomized block data 


CHAPTER 9 


Section 9.2 


9.2.1. Since t = 1.72 <to1,19 = 2.539, accept Hp. 
9.2.3. Since Zo; = 1.64 < t =5.67, reject Hy. 
9.2.5. Since —2.58 < t =—0.532 < zs = 2.58, do not 
reject Ho. 
9.2.7. Since —to 5.6 = 2.4469 < t = 0.69 < ftos56 = 2.4469, 
accept Hp. 
9.2.9. Since t = 2.16 > fo25,96 = 1.9880, reject Hy. 
9.2.11. (a) 22.880 (b) 166.990 
9.2.13. (a) 0.3974 (b) 0.2090 
9.2.15. E(S}) = E(S}) =o? by Example 5.4.4. 
2 2 
E(S2) _ a DE (s% Be ne (sz ) 


a — oe lo2 +(m— lo? 
n+m—2 


9.2.17. Since t = 2.16 > tos5,13 = 1.7709, reject Ho. 

9.2.19. (a) The sample standard deviation for the first data 
set is approximately 3.15; for the second, 3.29. These seem 
close enough to permit the use of Theorem 9.2.2. 

(b) Intuitively, the states with the comprehensive law should 
have fewer deaths. However, the average for these data is 
8.1, which is larger than the average of 7.0 for the states with 
a more limited law. 


—£.005 = 


=o? 


Section 9.3 


9.3.1. The observed F = 35.7604/115.9929 = 0.308. Since 
Foas.11,11 = 0.288 < 0.308 < 3.47 = Fo7511.11, We can accept Ho 
that the variances are equal. 


9.3.3. (a) The critical values are F 95.19.19 and F'o75.19,19. These 
values are not tabulated, but in this case, we can approximate 
them by F95,20,20 = 0.406 and F'975,20,20 = 2.46. The observed 
F = 2.41/3.52 = 0.685. Since 0.406 < 0.685 < 2.46, we can 
accept Hy that the variances are equal. 


(b) Since t = 2.662 > to5.33 = 2.0244, reject Hp. 


9.3.5. F = (0.20)?/(0.37)* = 0.292. Since 0.248 = Fy5.9.9 < 
0.292 < 4.03 = Fo75,.99, accept Hp. 
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9.3.7. F = 65.25/227.77 = 0.286. Since 0.208 = Forse < 
0.286 < 6.76 = Fo7535, accept Hy. Thus, Theorem 9.2.2 is 
appropriate. 

9.3.9. If 0; =o; =o’, the maximum likelihood estimator for 
o’ is 


1 n m 
a= (= () -FP +0, - 7). 
i=1 


n+m \jz 
n m 
- 1 ASE -shr(E or ov) 

Then L(@) = ar e ~° \el i=l 

Oo 
1 (n+m) /2 
— e7 ntm)/2 

2nG? 


If of 40; the maximum likelihood estimators for oj and 
2 
oy are 


7 (eee ” é 1< G. 
a= ; oe (x; —X)°, and Gy = a > (yi— yy. 
i=l =I 


l \ she ( wi?) ( iT i 
e~°x \i= 
2 2 
Ox 21 Oy 


Then L(Q) = (5 


n/2 m/2 
=, 1 —m/2 1 —n/2 
cad Caran s e m2 e 
20x Oy 


L(@) _ (Gy)"? Gy)" 
L(Q) = (G2) tm) 
expression given in the statement of the question. 


The ratio A= , which equates to the 


Section 9.4 


9.4.1. Since —1.96 < z= 1.76 < 1.96 =z 5, accept Hp. 
9.4.3. Since —1.96 < z= —0.17 < 1.96 =z.5, accept Hp at the 
0.05 level of significance. 


9.4.5. Since z= 4.25 > 2.33 = zo, reject Ho at the 0.01 level 
of significance. 

9.4.7. Since —1.96 < z= 1.50 < 1.96 = Zz, accept Hy at the 
0.05 level of significance. 

9.4.9. Since = 0.25 < 1.64=z 5, accept Ho. The player is right. 


Section 9.5 


9.5.1. (0.71, 1.55). Since 0 is not in the interval, we can reject 
the null hypothesis that 7x = py. 

9.5.3. Equal variance confidence interval is (—13.32, 
6.72). Unequal variance confidence interval is (—13.61, 
7.01). 

9.5.5. Begin with the statistic X — ¥Y, which has E(X — Y) = 


lly — fy and Var(X¥ — Y) = ox/n + o;/m. Then 
X-Y—-(uy —py) = : é : 
P ( ~zan< Spbit < 24a) =1—a, which implies 


P (~2any ox/n+o7/m<X—Y —(ux—py) 
<taayot/n-+ f/m) =1—a, 


Solving the inequality for zy — wy gives 


P (% =F -canyoR/n-tof/m sux — wy sk -7 
+any[03/n-+03/m) =l-a. 


Thus the confidence interval is 


(5-5 —cunyod/n + of/m 3 —F +240 a/n-+03/m) , 


9.5.7. (0.06, 2.14). Since the confidence interval contains 
1, we can accept Hy that the variances are equal, and 
Theorem 9.2.1 applies. 

9.5.9. (—0.021, 0.051). Since the confidence interval contains 
0, we can conclude that Flonase users do not suffer more 
headaches. 

9.5.11. The approximate normal distribution implies that 


X_Y_ = 
Pp 22 7 3 (Px — Py) es (ee ee 
joes + (Y/m)G—Y/m) 
or 
p ( me =X/n) ¥]/my=Y/m) _X _¥ 
n m nem 
Se eho, eae =X/n)  Y/m)— rim) 
n m 
=l-a 


which implies that 


o( (x ") Z| Ey OIE Tio 
n m n 


m 
xX Y 
<-(px- py) ( ) 


n m 


m 


+9] SEA CTT , 


Multiplying the inequality by —1 yields the confidence 
interval. 


CHAPTER 10 


Section 10.2 


10.2.1. 0.000886 
10.2.3. 0.00265 
10.2.5. 0.00649 
‘ 3 7 15 25 
10.2.7. (a) saisoa (a) (&) (&) (&) 
(b) Var(Xs) = 50 (gj) (<j) = 10.44 
10.2.9. Assume that My, x,.x,(t,6,t) = (pie + pre? + 
p3e°)". Then My, 5.x; (t1, 0, 0) = E(e™*!) =(pie" + prt ps)" = 
(1 — p; + pie)” is the mgf for X,. But the latter has the form 


of the mgf for a binomial random variable with parameters n 
and p;. 


Section 10.3 


t 4 2 ee ee) 
(Xj—npjy)? (x —2np; X;+n p?) 
10.3.1, > Sow* ye Be 


i=1 i=l i=1 
t 


n » D= » — —n. 

10.3.3. If the sampling is done with replacement, the number 
of white chips drawn should follow a binomial distribution 
(with n = 2 and p=0.4). Since the obs. x? = 3.30 < 4.605 = 
Xo.» fail to reject Hp. 

10.3.5. Let p = P(baby is born between midnight and 
4 A.M.). Test Ho: p = 1/6 vs. Hi: p 4 1/6; obs. z = 2.73; 
reject Hy) if a = 0.05. The obs. x* in Question 10.3.4 
will equal the square of the obs. z. The two tests are 
equivalent. 

10.3.7. Obs. x” = 12.23 with 5 df; x4, , = 11.070; reject Hp. 
10.3.9. Obs. x* = 18.22 with 7 df; x4, , = 14.067; reject Hh. 
10.3.11. Obs. x? = 8.10; x5, , = 3.841; reject Ho. 


t 
25 Sb 


i=l 


Section 10.4 


10.4.1. Obs. x? = 11.72 with 4—1-—1=2df; x3,, = 5.991; 
reject Ap. 

10.4.3. Obs. x? = 46.75 with 7— 1 —1=5df; x2, ; = 11.070; 
reject Hy. The independence assumption would not hold if 
the infestation was contagious. 

10.4.5. For the model f,(y)=Ae~’, 4 =0.823; obs. x? =4.181 
with 5— 1—1=3df; x4, , =7.815; fail to reject Hp. 

10.4.7. Let p = P(child is a boy). Then p = 0.533, obs. 
x’ =0.62, and we fail to reject the binomial model because 
X35; = 3.841. 

10.4.9. For the model px (k) = e73*’(3.87)*/k!, obs. x? = 12.9 
with 12 —-1—1=10df. But x, ,, = 18.307, so we fail to 
reject Hp. 

10.4.11. p =0.26; obs. x* = 9.23; x55; = 7.815; reject Hp. 


Section 10.5 

10.5.1. Obs. x? = 2.77; x9, = 2-706 and x5; , = 3.841, so Ho 
is rejected at the a =0.10 level but not at the w = 0.05 level. 
10.5.3. Obs. x? = 42.25; x4, , = 11.345; reject Ho. 

10.5.5. Obs. x* = 4.80; x5, , = 3.841; reject Hy. Regular use 
of aspirin appears to lessen the chances that a woman will 
develop breast cancer. 

10.5.7. Obs. x* = 12.61; x55 , = 3.841; reject Hp. 

10.5.9. Obs. x* = 2.197; x55 , = 3.841; fail to reject Ho. 


CHAPTER 11 
Section 11.2 


11.2.1. y= 25.23 + 3.29x; 84.5°F 
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11.2.3. 
Xj ye Sj 
0 —0.81 
4 0.01 
10 0.09 
15 0.03 
21 —0.09 
29 0.14 
36 0.55 
51 1.69 
68 -1.61 
A straight line appears to fit these data. 
11.2.5. The value 12 is too “far” from the data observed. 
11.2.7. The least squares line is y = 88.1+0.412x. 
120 5 
100 5 ° e 
3 4 eo way % ° e a e ry 
e sor - 
ig : 
S 60 5 ee 
=] 
3 40 
O 204 
10.0 120 140 160 180 200 220 24.0 
Spending 
11.2.9. The least squares line is y = 114.72+9.23x. 
25 4 
20 . 
15 4 
, 104 .- 
a e 
=} e 
gy 
% 
Ms 2 4 6 8 10 12 14 
-10 | ° 
454+ ° : ‘ 
—20 


A linear fit seems reasonable. 


11.2.11. The least squares line is y = 0.61 + 0.84x, which 
seems inadequate because of the large values of the residuals. 


11.2.13. When x is substituted for x in the least squares 
equation, we obtain y=a+bx=y—bx+bx=y. 

11.2.15. 0.03544 

11.2.17. y= 100—5.19x 


11.2.19. To find the a, b, and c, solve the following set of 
equations. 
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(1) nat+ (o») b+ (>> vs) c= yoy: 
i=l i=l i=l 
(2) (3s)o+ (Sox) e+ (> sins Je= Do 
i=1 i=1 i=l i=1 
(3) (> css) at+ (> Xj css) b+ 
i=1 i=1 
(> (cos x;) (sin ») c= > y; COS X; 


i=1 i=1 


11.2.21. (a) y=4.6791e°"*** (b) 8.362 trillion 
(c) Part (b) and the residual pattern cast doubt on the 
exponential model. 

0.08 


0.06 ? 
0.04 ‘ 
0.02 . 

0 
-0.02 
-0.04 
-0.06 : . 
-0.08 - 


—0.1 Years after 1995 


10 12 


Residuals 
ND 
of 
lon 
Tee) 


11.2.23. y=819.4e°!8 
11.2.25. The model is y = 13.487x'°*8. 
11.2.27. The model is y = 0.07416x!8%’. 


9000 

8000 e 
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0 
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Striate Cortex 


11.2.29. (d) If y= 


1 —_ wer 
an? then 5 =at dx, and 1/y is linear 


with x. 

(e) If y=, then += =b+<a}, and 1/y is linear 
with 1/x. 

() Ify=1l—e~"", then 1—- y=e™", and = =e", Tak- 
ing the In of both sides gives In + = x’/a. Taking the In again 


yields In In = =—Ina+blnx, and InIn + is linear with In x. 


112.31. a =5.55870; b = —0.13314 


60 
55 
50 
45 
40 
35 
30 


% Improved 
N 
Nn 


Age Mid-Point 


Section 11.3 

11.3.1. y = 13.8 — 1.5x; since —to25,. = —4.3027 < t = —1.59 
< 4.3027 =t5,2, accept Hp. 

11.3.3. Since t = 5.47 > to05,13 = 3.0123, reject Hp. 

11.3.5. 0.9164 

11.3.7. (66.551, 68.465) 

11.3.9. Since t = 4.38 > to95,9 = 2.2622, reject Ho. 

11.3.11. By Theorem 11.3.2, E(B) = By, and 


o a 
Var(Bo) =—_ 
ny (4 — x)? 
i=1 
Now (Bo — Bo)/ Var(Bo) is normal, so 
P (—2n = (Bo _ Bo)/ Var(Bo) = cu} =1-a. 


Then the confidence interval is 


(A — Za/24/ Var(o), Bo + 20/21 var(é,)) 


or 


11.3.13. Reject the null hypothesis if the test statistic is 
< Xios.22 = 10.982 or > xX575... = 36.781. The observed chi 
(n—2)s? _ (24—2)(18.2) 


= 31.778, d t 
e D6 so do no 


square is 
reject Hp. 
11.3.15. (2.655, 17.237) 
11.3.17. (2.060, 2.087) 


11.3.19. The confidence interval of (173.89, 214.13) does not 
contain the Harvard median salary. The prediction interval 
of (147.40, 240.62) does. 


11.3.21. The test statistic is t = 


where s=,/ piensa = 1.407. 
6+8-4 


i — 1.07 
Then t= — g ; =—1.42 
1.407,/ —— + — 
31.33 = 46 
Since the observed ratio is not less than —fo 05,19 = —1.8125, 


the difference in slopes can be ascribed to chance. These data 
do not support further investigation. 


113.23. The form given in the text is Var(Y) a 


2) ly ace 
Y G7? 
least common denominator gives 


. Putting the sum in the brackets over a 


. =\2 =\2 
1 (x3) 2s (e-3) +n(x —x) 


(4; — 3) nd (4; 3) 
j=] iat 


Sox? — nx? + n(x? +x? — 2xX) 


i=1 


ny (x; — 3° 


i=l 


n 
Yi x? + nx? — 2nxx 
i=1 


no (4; - 3)? 
f=1, 


n n 

2 2 
y xX; + nx —2x Sox; 
i=l i=l 


n 
Wd ea)" 
i=l 


je o BE Qy =? 
Thus, Var(Y) = —=+—_ 


Section 11.4 
11.4.1. —2/121; —2/15/14 


11.4.3. 0.492 


bX,c+dY 
11.4.5. p(a + bX,c + dY) = Cov(a+bX,c+dY) 


/Warla+bX)Vare+dY) 
bd Cov (X, Y) foto a! 
, the equality in the numerators stem- 
a/b? Var(X )d?Var(Y) 


ming from Question 3.9.14. Since b > 0, d > 0, this last 
expression is 


bdCov(X,Y)  Cov(Xx, Y) 
= =p(X,Y). 


bdoxoy OxOy 
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11.4.7. (a) Cov(X+Y,X-—Y)=E[(X+Y)(X—-Y)] 
-E(X+Y)E(X—-Y) 
= E[X? — ¥*]— (ux + My) 
(x — by) 
= E(X*) — py — EY?) +3 
= Var(X) — Var(Y) 
(b) p(X+Y)= yeaa ah By Part (a) 
Cov(X + Y, X — Y) = Var(X) — Var(Y). 
Var(X + Y) = Var(X) + Var(Y) = 2Cov(X, Y) 


= Var(X) + Var(Y) + 0. 
Similarly, Var(X — Y) = Var(X) + Var(Y). Then 


Var(X) — Var(Y) 
/(Var(X) + Var(Y)) (Var(X) + Var(Y)) 
Var(X) — Var(Y) 
~ Var(X) + Var(Y)” 


O(X+Y)= 


11.4.9. By Equation 11.4.2 


ny Oxi zl ( x) (= ») 
r= i=l i=1 i=1 
n n 2 n n 2 
nox? (Ex) \' #-(E») 
i=l i=l i=l i=l 
nyo xy; (Ss) (E3) 
i=l i=l i=l 


11.4.11. + = —0.030. The data do not suggest that altitude 
affects home run hitting. 
11.4.13. 58.1% 


Section 11.5 


11.5.1. 0.1891; 0.2127 


11.5.3. (a) fra (= 3 I 


if 
exp| 5 (7) yy —2p(t 


yy + | dy 
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The expression in the brackets can be expanded and 
rewritten as 


+24 p)y? — 241+ p)y=t? +204 ply’ —ty] 
rt 
=t+2(1+ )) [2-14 | 
ae + p)t? 
5 +p 


= eee +2(1+ p)(y —t/2)’. 


Placing this expression into the exponent gives 


1 

t) = —_—~e 
friv@) in lap 

oo y—t/2)2 
7 I oil ate) / oo} (tae Vs 
21/1 — p? 2 

The integral is that of a normal pdf with mean t/2 and 
o* = (1+ p)/2. Thus, the integral equals /27(1+ p)/2 = 
/x(1+ p). Putting this into the expression for fy, gives 


oe oil ate) 

V2n 20+ p) 

which is the pdf of a normal variable with p =0ando* = 
2(1 + p). 

(b) E(X + Y) =cpy + dpy, Var(X + Y) = Po} + Bop + 
2cdoyoy p(X, Y) 

11.5.5. E(X) = E(Y) = 0; Var(X) = 4; Var(Y) = 1: p(X, Y) = 
1/2; k= 1/22 J3 

11.5.7. Since —t095,13 = —2.8784 < T,_» = —2.156 < 2.8784 = 
toos,ig, accept Ap. 

11.5.9. Since —to5,19 = —2.2281 < T,_. = —0.094 < 2.2281 = 
to2s,10, accept Ap. 


foe) 


frvO= 


= — —v8(0.249) 
11.5.11. r=0.249. T; = Pred 0.73 
Since Tz = 0.73 < 1.397 = 110.3, accept Hp. 
CHAPTER 12 
Section 12.2 


12.2.1. Obs. F = 3.94 with 3 and 6 df; F536 = 4.76 and 
F03,6 = 3.29, so Hy would be rejected at the a = 0.10 level, 
but not at the w= 0.05 level. 


12.2.3. 


Source df SS MS F 


Sector 2 186.0 93.0 3.44 
Error 27 728.2 27.0 
Total 29 914.2 


F927 does not appear in Table A.9, but F'9.739 = 5.39 < 
F997 < Fo9,2.24 = 5.61. Thus, we fail to reject Hy, since 
3.44 < 5.39 


pa 1 lee ,2 scenes (ee oes —1/2)2 
Mea) at [Aaron 


12.2.5. 
Source df SS MS F P 
Tribe 3 504167 168056 3.70 0.062 
Error 8 363333 45417 
Total 11 867500 


Since the P-value is greater than 0.01, we fail to reject Hp. 


12.2.7. 


Source df SS MS F 
Treatment 4 271.36 67.84 6.40 
Error 10 106.00 10.60 

Total 14 377.36 


12.2.9. 


nj 


ko Nj k 
sstor=)*)> (¥,-¥.) =>° >> (v2 -2¥,¥. + ¥?) 


j=l i=l j=l i=l 


koonj kj 


j=l i=l j=l i=l 


kj 
= SII ¥2 = 2nF? 4 nF? 


j=l i=l 


ko oNj ko Nj 
=DLyaor=Lyy-< 


j=l i=l j=l i=l 


where C = T’/n. Also, 


k k 
=) Tj /m) - 2.) nj¥j +n¥? 
j=l j=l 


=) Tj /n; —2n¥? +nY? 


j=l 


12.2.11. Analyzed with a two-sample f test, the data 
in Question 9.2.8 require that Hy: jx = by be rejected 
(in favor of a two-sided H;) at the a = 0.05 level if 
|t| > tovs.619-2 = 2.1604. Evaluating the test statistic gives 
t = (70.83 — 79.33)/11.31./1/6+ 1/9 = —1.43, which implies 
that Hp should not be rejected. The ANOVA table for the 


same data shows that F = 2.04. But (—1.43)? = 2.04. More- 
over, Hy) would be rejected with the analysis of variance if 
F > Fos1.13 = 4.667. But (2.1604)? = 4.667. 


Source df SS MS F 


Sex 1 260 260 2.04 
Error 13 1661 128 
Total 14 1921 
12.2.13. 
Source df SS MS F P 
Law 1 16.333 16.333 1.58 0.2150 
Error 46 475.283 10.332 
Total 47 491.616 


The F critical value is 4.05. 

For the pooled two-sample f test, the observed f¢ ratio is 
—1.257, and the critical value is 2.0129. 

Note that (— 1.257)? = 1.58 (rounded to two decimal places) 
which is the observed F ratio. Also, 2.0129? = 4.05 (rounded 
to two decimal places), which is the F critical value. 


Section 12.3 
12.3.1. 


Pairwise Difference Tukey Interval Conclusion 


eps is (—15.27, 13.60) NS 
pene re (—23.77,5.10) NS 
[hy — M4 (—33.77, —4.90) Reject 
is (—22.94,5.94) NS 
(is 00 (—32.94, —4.06) Reject 
fix he (—24.44,4.44) NS 


12.3.3. Obs. F =5.81 with 2 and 15 df; reject Ho: uc=Ma=lUu 
at a =0.05 but not ata =0.01. 
Conclusion 


Pairwise Difference Tukey Interval 


edie (—78.9,217.5) NS 
(ices (—271.0,25.4) NS 
fe bey (—340.4, 44.0) Reject 
12.3.5. 
Pairwise Difference Tukey Interval Conclusion 


ip tbs (—29.5, 2.8) NS 
Mi — 13 (—56.2, —23.8) Reject 
a "its (—42.8,-10.5) Reject 
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12.3.7. Longer. As k gets larger, the number of possible 
pairwise comparisons increases. To maintain the same over- 
all probability of committing at least one Type I error, the 
individual intervals would need to be widened. 


Section 12.4 

12.4.1. 

Source df SS MS F 
Tube 2 510.7 255.4 11.56 
Error 42 927.7 22.1 
Total 44 1438.4 
Subhypothesis Contrast SS F 

Ao: ha = ke Ci =a ec 264 11.95 

H: uy =H C=!p4—Metiue 246.7 11.16 


A: La = Mp = Mc is Strongly rejected (F'9,2.42 = F'99,2,49 = 5.18). 
Theorem 12.4.1 holds true for orthogonal contrasts C; and 
C,—SSc, — SSc, = 264 + 246.7 = 510.7 = SSTR. 

12.4.3. C = —14.25; SSc = 812.25; obs. F = 10.19; F510 = 
4.35; reject Hp. 

12.4.5, 


Q 

nN 
BIZ OR 
| 

= 

ooo 


i 
12 6 


C,; and C; are orthogonal because “VS? + COC) — 9; 
also C, and C; are orthogonal because *—? + UC = 
0.C; = —2.293 and SSc, = 8.97. But SSc, + SSc, + SSc, 
4.68 + 1.12+ 8.97 = 14.77 =SSTR. 


Section 12.5 


12.5.1. Replace each observation by its square root. At the 
a = 0.05 level, Ho: 44 = le is rejected. (For a = 0.01, though, 
we would fail to reject Ho.) 


Source df Ss MS F P 
Developer 1 1.836 1.836 6.23 0.032 
Error 10 2.947 0.295 

Total 11 4.783 


12.5.3. Since Y;; is a binomial random variable based on 
n= 20 trials, each data point should be replaced by the arc- 
sin of (y;;/20)'"”. Based on those transformed observations, 


Ao: La = [kp = Uc is Strongly rejected (P < 0.001). 


742 Answers to Selected Odd-Numbered Questions 


Source df SS MS F P 
Launcher 2 0.30592 0.15296 22.34 0.000 
Error 9 0.06163 0.00685 

Total 11 0.36755 


Appendix 12.A.3 


12.A.3.1. The F test will have greater power against H;* 
because the latter yields a larger noncentrality parameter 
than does H;. 


12.A.3.3. My (t)=(1 —21)7"2e”"4-29" so M? (1) = (1 — 28)? 


evtl-2)! [yt(—1)C — 2r) ?(—2) + 1 = 28) 'y] + 


ertl-2y-! (-5) (1 —21)-¢)-1 (2), 


Therefore E(V)=M(0)=y +r. 
n — y 7/2 
12.A.3.5. My(t) = [[ (1 —21)ert/-2) = (1 —2t) I. 
i=l 


(z n)y0-29 vena heat: ae 
e\=l which implies that V has a noncentral x? distri- 


bution with >> 7; df and with noncentrality parameter )° y;. 
i=l 


i=l 


CHAPTER 13 

Section 13.2 

13.2.1. 
Source df SS MS F P 
States 1 61.63 61.63 7.20 0.0178 
Students 14 400.80 28.63 3.34 0.0155 
Error 14 119.87 8.56 
Total 29 582.30 


The critical value F514 1s approximately 4.6. Since the 
F statistic = 7.20 > 4.6, reject Hp. 


13.2.3. 
Source df SS MS F P 
Additive 1 0.03 0.03 4.19 0.0865 
Batch 6 0.02 0.00 0.41 0.8483 
Error 6 0.05 0.01 
Total 13 0.10 


Since the F statistic= 4.19 < F516 =5.99, accept Hp. 


13.2.5. From the Table 13.2.9, we obtain MSE 
6.00. The radius of the Tukey interval is DYMSE = 
(Q.05,3.22/Vb)V6.00 = (3.56/12)/6.00 = 2.517. The Tukey 


intervals are 


Pairwise Difference y,—y, Tukey Interval Conclusion 
es —2.41  (—4.93, 0.11) NS 
Mi — M3 —0.54  (—3.06, 1.98) NS 
[Ln — [3 1.87  (-0.65, 4.39) NS 


From this analysis and that of Case Study 13.2.3, we 
find that the significant difference occurs not for overall 
means testing or pairwise comparisons, but for the com- 
parison of “during the full moon” with “not during the full 
moon.” 


13.2.7. 


Pairwise Difference y,—y, Tukey Interval Conclusion 
Sab 2.925 (0.78, 5.07) Reject 
py abs 1.475 (—0.67, 3.62) NS 
[hy — [bs —1.450  (—3.60, 0.70) NS 

13.2.9. (a) 

Source df SS MS F P 
Sleep stages 2 1699 849 4.13 0.0493 
Shrew 5 195.44 39.09 19.00 0.0001 
Error 10 20.57 2.06 

Total 17. 233.00 


(b) Since the observed F ratio = 2.42 < Fs 1,10 = 4.96, accept 
the subhypothesis. For the contrast C, = —} — $M. + 
b3, SSc, =4.99. For the contrast C, = 4, — f2, SSc, = 12.00. 
Then SSTR = 16.99 = 4.99 + 12.00 = SSc, + SSc,. 


13.2.11. Equation 13.2.2: 


k 


ssTR=ON(Y, -Y )=b)-(¥,-¥.) 


i=l j=l j=l 


k k 
=b)°¥,—2bY.. DY, 4+bkY- 
j=l j=l 


T? Tf? 


en ey oe 
bk © bk “4b bk 


LF 
j=l 


Equation 13.2.3: 
b k 
SSB=)~ 0 (¥.-¥.) =k (%.-¥.)° 
i=1 j=l i=l 
b b b 
=k)" (¥,-2V,¥.+¥ )=k>-Y, -2kY. SCY, 

i=1 i=1 i=1 

+bkY 


Equation 13.2.4: 


b k b k 
SSTOT= ~~ (¥%;—¥..) =>->- (v2 -2¥,¥..+¥°) 


i=1 j=l i=1 j=l 


b k b k 
=>" > ¥2-2¥., °° ¥, +bKY- 


i=1 j=l i=l j=l 


DF? ae? 
sO eters tar cep SY ei 


i=1 j=l i=1 j=l 


13.2.13. (a) False. They are equal only when b=k. 
(b) False. If neither treatment levels nor blocks are 
significant, it is possible to have F variables 


SSTR/(k — 1) 
SSE/(b — 1)(k — 1) 


SSB/(b—1) 
SSE/(b — 1)(k — 1) 


both < 1. 


In that case both SSTR and SSB are less than SSE. 


Section 13.3 

13.3.1. Since 1.51 < 1.7341 =fo5,13, do not reject Hp. 

13.3.3. a =0.05: Since —t9s,;; = —2.2010 < 0.74 < 2.2010 = 
tors.11, accept Ap. 

a = 0.01: Since —t95,11 = —3.1058 < 0.74 < 3.1058 = f00s,11, 
accept Hp. 


13.3.5. Since —to95,6 = —2.4469 < —2.0481 < 2.4469 = to25.6, 
accept Hy. The square of the observed Student f statistic = 
(—2.0481)? = 4.1947 = the observed F' statistic. Also, 
(t.095,6)" = (2.4469)? = 5.987 = Fos,1.6- Conclusion: the square of 
the ¢ statistic for paired data is the randomized block design 
statistic for 2 treatments. 


13.3.7. (—0.21, 0.43) 


CHAPTER 14 


Section 14.2 


14.2.1. Here, x =8 of the n = 10 groups were larger than the 
hypothesized median of 9. The P-value is P(X > 8)+ P(X < 
2) = 0.000977 + 0.009766 + 0.043945 + 0.043945 + 0.009766 + 
0.000977 = 2(0.054688) = 0.109376. 
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14.2.3. The median of f,(y) is 0.693. There are x = 22 values 
that exceed the hypothesized median of 0.693. The test statis- 


Sisal 22 — 50/2 0.85. Si (ee cue 
tic se Sey .85. Since —Z9.925 = 96 < 85 < 
Zo.025 = 1.96, do not reject Hp. 

14.2.5. 
ys P(Y,=ys) 
0 1/128 
1 7/128 
2 21/128 
3 35/128 
4 35/128 
5 21/128 
6 7/128 
ck 1/128 


Possible levels for a one-sided test: 1/128, 8/128, 29/128, etc. 
14.2.7. P(Y, <6) = 0.0835; P(Y, <7) =0.1796. The closest 
test to one with a = 0.10 is to reject Ay if y, <6. Since y, = 9, 
accept Hp. Since the observed f statistic = —1.71 < —1.330= 
—tio1s, reject Ay. 

14.2.9. The approximate, large-sample observed Z ratio is 
1.89. Accept Ho, since —Z.o25 = —1.96 < 1.89 < 1.96 =z. 
14.2.11. From Table 13.3.1, the number of pairs where x; > y; 
is 7. The P-value for this test is P(U > 7)+ P(U <3)= 
2(0.17186) = 0.343752. Since the P-value exceeds a = 0.05, 
do not reject the null hypothesis, which is the conclusion of 
Case Study 13.3.1. 


Section 14.3 


14.3.1. For the critical values of 7 and 29, a = 0.148. Since 
w =9, accept Ap. 

14.3.3. The observed Z statistic has value 0.99. Since 
—Zoos = —1.96 < 0.99 < 1.96 = 2.95, accept Ho. 


14.3.5. Since w’ = Obie 
617.5 


Ho. The sign test accepted Ap. 


1.37 < —1.28= 


Z10, reject 


14.3.7. The signed rank test should have more power since it 
uses more of the information in the data. 

14.3.9. A reasonable assumption is that alcohol abuse short- 
ens life span. In that case, reject Ho if the test statistic is less 
than —zo9; = —1.64. Since the test statistic has value —1.88, 
reject Ho. 


Section 14.4 


14.4.1. Assume the data within groups are independent and 
that the group distributions have the same shape. Let the 
null hypothesis be that teachers’ expectations do not mat- 
ter. The Kruskal-Wallis statistics has value b = 5.64. Since 
5.64 < 5.991 = xo952, accept Hp. 
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14.4.3. Since b= 1.68 < 3.841 =x, ,, do not reject Ho. 
14.4.5. Since b= 10.72 > 7.815 = 755,, reject Hp. 


14.4.7. Since b= 12.48 > 5.991 = x4, 5, reject Ho. 


Section 14.5 
14.5.1. Since g=8.8 < 9.488 = x{, ,, accept Ho. 
14.5.3. Since g = 17.0>5.991= x4, ,, reject Ho. 


14.5.5. Since g = 8.4 < 9.210 = x09, accept Hy. On the 
other hand, using the analysis of variance, the null hypothesis 
would be rejected at this level. 


Section 14.6 


14.6.1. (a) For these data, w = 23 and z = —0.53. Since 
Zo2s = —1.96 < —0.53 < 1.96 = zs, accept Hy) and assume 

the sequence is random. 

(b) For these data, w = 21 and z = —1.33. Since —z 95 = 

—1.96 < —1.33 < 1.96 = zs, accept Hy) and assume the 

sequence is random. 

14.6.3. For these data, w= 19 and z = 1.68. Since —z; = 

—1.96 < 1.68 < 1.96= zs, accept Hy and assume the sequence 

is random. 

14.6.5. For these data, w= 25 and z= —0.51. Since —z 5 = 

—1.96 < —0.51 < 1.96 = zs, accept Ho at the 0.05 level of 

significance and assume the sequence is random. 
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Beta distribution, 336 
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Binomial distribution: 
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definition, 104-105 
estimate for p, 282-283, 312-313, 321 
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moment-generating function, 208 
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Curve-fitting: 
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Decision rule (see Hypothesis testing; Testing) 
DeMoivre-Laplace limit theorem, 239-240, 246 
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Efficiency, 317-319, 322 
Efficient estimator, 332 
Estimation (see also Confidence interval; Estimator): 
Bayesian, 333-344 
least squares, 533-534 
maximum likelihood, 282-291 
method of moments, 293-296 
point versus interval, 297-298 
Estimator (see also Confidence interval; Estimation): 
best, 321-322 
for binomial p, 282-283, 312-313, 321 
for bivariate normal parameters, 586-587 
consistent, 330-333 
for contrast, 612-613 
for correlation coefficient, 577-578 
Cramér-Rao lower bound, 320 
difference between estimate and estimator, 283, 286 
efficient, 322 
for exponential parameter, 288 
for gamma parameters, 295-296 
for geometric parameter, 288-290 
interval, 297-298 
for normal parameters, 285-286, 315-316 
for Poisson parameter, 285-286, 326-327, 344 
for slope and y-intercept (linear model), 557-560 
sufficient, 323, 326-327 
unbiasedness, 313-316 
for uniform parameter, 331, 347-349 
for variance in linear model, 557, 561 
Event, 18 
Expected value (see also “moments” listings for specific 
distributions): 
conditional, 555-557 
definition, 140, 160-161 
examples, 139-146, 183-185, 598-599 
of functions, 150-154, 183, 185, 187-188, 192 
of linear combinations, 192 
of loss functions, 342 
in method of moments estimation, 293-294 
relationship to median, 147 
relationship to moment-generating function, 210 
of sums, 185 
Experiment, 18 
Experimental design, 430, 435, 448-450, 595-596, 629-630, 
635-636, 647-653 
Exponential distribution: 
examples, 134-135, 145, 147-148, 180-182, 194-195, 236-237, 
275, 408 
moment-generating function, 208-209 
moments, 145-146, 211 
parameter estimation, 287-288 
relationship to Poisson distribution, 235-236 
threshold parameter, 288 


Exponential form, 330 
Exponential regression, 3, 544-547 


Factor, 431-432 

Factorial moment-generating function, 262 

Factorization theorem, 327-328 

Factor levels, 431-432 

F distribution: 
in analysis of variance, 601, 614, 633 
definition, 390 
in inferences about variance ratios, 471-472, 483 
relationship to chi square distribution, 390 
relationship to Student f distribution, 391-392 
table, 391, 703-717 

Finite correction factor, 309 

Fisher’s lemma, 425 

Friedman’s test, 682-683, 694-695 


Gamma distribution: 
additive property, 273 
definition, 270, 272 
examples, 271, 294-296 
moment-generating function, 273 
moments, 272, 274, 294 
parameter estimation, 294-296 
relationship to chi square distribution, 389 
relationship to exponential distribution, 270, 337-338 
relationship to normal distribution, 389 
relationship to Poisson distribution, 270 
Generalized likelihood ratio, 379-380 
Generalized likelihood ratio test (GLRT): 
definition, 381 
examples, 379-382, 401, 425-429, 476-477, 488-491, 500, 597 
Geometric distribution: 
definition, 260-261 
examples, 261-262, 288-290 
moment-generating function, 207-208, 261 
moments, 211, 261 
parameter estimation, 288-290 
relationship to negative binomial distribution, 262-263 
Geometric probability, 166-168 
Goodness-of-fit test (see Chi square test) 


Hazard rate, 139 
Hypergeometric distribution: 
definition, 110-112 
examples, 112-116, 142, 
moments, 142-143, 191-192, 309 
relationship to binomial distribution, 110, 202-203 
Hypothesis testing (see also Testing): 
critical region, 355 
decision rule, 351-354, 374-377, 381 
level of significance, 355 
P-value, 358-359, 362-363 
Type I and Type II errors, 366-369, 608 


Independence: 
effect of, on the expected value of a product, 188 
of events, 34, 53, 58-59 
mutual versus pairwise, 58-59 
of random variables, 173-175, 187-188 
of regression estimators, 560, 592-594 
of repeated trials, 61 
of sample mean and sample variance (normal data), 390, 
423-425, 560 
of sums of squares, 600, 632 
tests for, 494, 519-527 


Independent samples, 433, 437-439, 457-458, 596, 649-653, 
673-674, 677-678 

Intersection, 21 

Interval estimate (see Confidence interval; Prediction interval) 


Joint cumulative distribution function, 171-172 
Joint probability density function, 162-165, 172 


Kruskal-Wallis test, 677-681, 689-694 
k-sample data, 439-440, 595-596, 677-678 
Kurtosis, 161 


Law of small numbers, 230-231 
Level of significance, 355, 359, 366-367, 375-377, 608-609 
Likelihood function, 284 
Likelihood ratio (see Generalized likelihood ratio) 
Likelihood ratio test (see Generalized likelihood ratio test (GLRT)) 
Linear model (see also Curve-fitting): 
assumptions, 443-444, 555-557 
confidence intervals for parameters, 564-565, 567 
hypothesis tests, 562, 568-569, 572 
parameter estimation, 557, 561 
Logarithmic regression, 547-549 
Logistic regression, 549-552 
Loss function, 341-343 


Margin of error, 305-307 
Marginal probability density function, 164, 169-170, 339-340, 496-497 
Maximum likelihood estimation (see also Estimation): 
definition, 285 
examples, 282-283, 285-291, 557-558 
in goodness-of-fit testing, 509 
properties, 329, 333 
in regression analysis, 557-558, 561 
Mean (see Expected value) 
Mean free path, 145 
Mean square, 602 
Median, 147, 304, 317, 333, 657 
Median unbiased, 317 
Method of least squares (see Estimation) 
Method of moments (see Estimation) 
Minimum variance estimator, 321 
MINITAB calculations: 
for cdf, 219-220, 278-279 
for completely randomized one-factor design, 621-623 
for confidence intervals, 299-300, 422, 491 
for choosing samples, 487-488 
for critical values, 422 
for Friedman’s test, 694-695 
for independence, 531, 590-591 
for Kruskal-Wallis test, 694 
for Monte Carlo analysis, 274-278, 299-300, 347-349, 354, 407-408 
for one-sample f test, 423 
for pdf, 219, 365 
for randomized block design, 653-654 
for regression analysis, 590-592 
for robustness, 407-409 
for sample statistics, 421 
for Tukey confidence intervals, 622-623 
for two-sample f test, 491-492 
Model equation, 436-437, 439, 442-443, 597, 631 
Moment-generating function (see also “moment-generating function’ 
listings for specific distributions): 
definition, 207 
examples, 207-209 
in proof of central limit theorem, 280 
properties, 210, 214 
relationship to moments, 210, 212 
as technique for finding distributions of sums, 214-215 


oy 


Index 


Moments (see Expected value; Variance; “moments” listings 
for specific distributions) 

Monte Carlo studies, 100-101, 274-278, 299-301, 347-349 

Moore’s Law, 545-547 

Multinomial coefficients, 81 

Multinomial distribution, 494-496, 521 

Multiple comparisons, 608-611 

Multiplication rule, 68 

Mutually exclusive events, 22, 27, 55 


Negative binomial distribution, 262-268 
definition, 262-263 
examples, 126, 264-266, 340 
moment-generating function, 263-264 
moments, 263-264 
Noncentral chi square distribution, 625 
Noncentral F distribution, 626-628 
Noncentral t distribution, 419 
Noninformative prior, 336 
Nonparametric statistics, 656 
Normal distribution (see also Standard normal distribution): 
additive property, 257-258 
approximation to binomial distribution, 239-240, 242-244, 279 
approximation to sign test, 657 
approximation to Wilcoxon signed rank statistic, 669 
central limit theorem, 239-240, 246-249, 280 
confidence interval for mean, 298-302, 396-398 
confidence interval for variance, 412 
definition, 251 
hypothesis test for mean (variance known), 357 
hypothesis test for mean (variance unknown), 401, 406-409 
hypothesis test for variance, 415, 427-429 
independence of sample mean and sample variance, 390, 423-425 
as limit for Student r distribution, 386-388, 393 
in linear model, 556-557 
moment-generating function, 209, 215 
moments, 251 
parameter estimation, 290-291, 315-316 
relationship to chi square distribution, 389, 391, 417 
relationship to gamma distribution, 389 
table, 240-242, 697-698 
transformation to standard normal, 215-216, 
252-257, 259 
unbiased estimator for variance, 315-316, 561 
Null hypothesis, 350, 358 


One-sample data, 435-437, 657 
One-sample t test, 401, 423, 425-426 
Operating characteristic curve, 116 
Order statistics: 

definition, 193 

estimates based on, 288, 314, 319, 331 

joint pdf, 198 

probability density function for ith, 194, 196 
Outliers, 529-531 


Paired data, 440-442, 642-643, 660-661, 672-673 
Paired t test, 440, 642-644, 649-653 
Pairwise comparisons (see Tukey’s test) 
Parameter, 281-282 
Parameter space, 380-381, 425, 427 
Pareto distribution, 292, 297, 330, 504-505 
Partitioned sample space, 43, 48 
Pascal’s triangle, 88 
Pearson product moment correlation coefficient, 578 
Permutations: 

objects all distinct, 74 

objects not all distinct, 80 
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Poisson distribution: 
additive property, 214-215 
definition, 227 
examples, 121, 224-226, 228-230, 233, 408 
hypothesis test, 375-377 
as limit of binomial distribution, 222-223, 232 
moment-generating function, 213 
moments, 213, 227 
parameter estimation, 285-286, 326-327, 337-338, 344 
relationship to exponential distribution, 235-236 
relationship to gamma distribution, 270, 337-338 
square root transformation, 618 
Poisson model, 230-231 
Poker hands, 96-97 
Political arithmetic, 11-13 
Posterior distribution, 335-339 
Power, 369-373, 628 
Power curve, 369-370, 382-383 
Prediction interval, 571, 592 
Prior distribution, 334-339 
Probability: 
axiomatic definition, 18, 27-28 
classical definition, 9, 17 
empirical definition, 17-18 
Probability density function (pdf), 124, 135-136, 172, 178, 181-182 
Probability function, 27-28, 119, 129-131 
Producer’s risk, 377 
P-value, 358-359, 362-363 


Qualitative measurement, 434 
Quantitative measurement, 434 


Random deviates, 266-269, 279 
Random Mendelian mating, 56-57 
Randomized block data, 442-443, 629-630 
Randomized block design: 
block sum of squares, 632 
comparison with completely randomized one-factor design, 635-636 
computing formulas, 634 
error sum of squares, 631-632 
notation, 631 
relationship to paired r test, 648 
test statistic, 633 
treatment sum of squares, 632 
Random sample, 175 
Random variable, 102-103, 119, 124, 135-136 
Range, 199 
Rank sum test (see Wilcoxon rank sum test) 
Rayleigh distribution, 146 
Rectangular distribution (see Uniform distribution) 
Regression curve, 555-557, 586 
Regression data, 443-446, 532, 555-557, 575-576 
Relative efficiency, 317-319 
Repeated independent trials, 61, 495 
Resampling, 345 
Residual, 535 
Residual plot, 535-540 
Risk, 342-344 
Robustness, 399, 406-409, 420-421, 462-463, 517, 656, 689-693 
Runs, 684-687 


Sample correlation coefficient: 
definition, 577-578 
interpretation and misinterpretation, 578-579, 589-590 
in tests of independence, 587-589 
Sample outcome, 18 
Sample size determination, 307-308, 373-374, 414, 455-456 
Sample space, 18 


Sample standard deviation, 316 
Sample variance, 316, 394, 459, 561, 572, 599-600 
Sampling distributions, 388-389 
Serial number analysis, 6 
Sign test, 657-661, 693 
Signed rank test (see Wilcoxon signed rank test) 
Simple linear model (see Linear model) 
Skewness, 161 
Spurious correlation, 589-590 
Square root transformation, 617-618 
Squared-error consistent, 333 
Standard deviation, 156, 316 
Standard normal distribution (see also Normal distribution): 
in central limit theorem, 246-247, 251 
definition, 240 
in DeMoivre-Laplace limit theorem, 239-240 
table, 240-242, 697-698 
Z transformation, 215-216, 252, 257 
Statistic, 283 
Statistically significant, 355, 382-384 
Stirling’s formula, 76-77, 82 
St. Petersburg paradox, 144-145 
Studentized range, 608-609, 718-719 
Student ¢ distribution: 
approximated by standard normal distribution, 386-388, 393 
definition, 391-393 
in inferences about difference between two dependent 
means, 644 
in inferences about difference between two independent means, 
458-460, 460, 468 
in inferences about single mean, 396, 401 
in regression analysis, 561-562, 564-565, 567, 569-572, 587 
relationship to chi square distribution, 391 
relationship to F distribution, 391-392 
table, 395-396, 699-701 
Subhypothesis, 597, 608-609, 612-614 
Sufficient estimator: 
definition, 323, 326-328 
examples, 323-329 
exponential form, 330 
factorization criterion, 327-328 
relationship to maximum likelihood estimator, 329 
relationship to minimum variance, unbiased estimator, 329 


t distribution (see Student r distribution) 
Testing (see also Hypothesis testing) 

that correlation coefficient is zero, 587-589 

the equality of k location parameters (dependent samples), 
682-683 

the equality of k location parameters (independent samples), 
677-678 

the equality of k means (dependent samples), 632-633 

the equality of k means (independent samples), 599-601 

the equality of two location parameters (dependent samples), 
660-661 

the equality of two location parameters (independent samples), 
673-674 

the equality of two means (dependent samples), 644 

the equality of two means (independent samples), 460, 468, 
606-607 

the equality of two proportions (independent samples), 476-478 

the equality of two slopes (independent samples), 572 

the equality of two variances (independent samples), 471-472 

for goodness-of-fit, 494, 499-500, 506-508, 510, 642-644 

for independence, 494, 519-527, 562, 587 

the parameter of Poisson distribution, 375-377 

the parameter of uniform distribution, 379-382 

for randomness, 685 


a single mean with variance known, 357 
a single mean with variance unknown, 401, 425-426 
a single median, 657 
a single proportion, 361, 364-365 
a single variance, 415, 427-429, 567-568 
the slope of a regression line, 562, 591 
subhypotheses, 608-609, 614 
Test statistic, 355 
Threshold parameter, 288 
Total sum of squares, 600-601, 604 
Transformations: 
of data, 617-618 
of random variables, 176-182 
Treatment sum of squares, 598-601, 604, 614, 624, 632 
Trinomial distribution, 498-499 
Tukey’s test, 608-610, 622-623, 637-638 
Two-sample data, 437-439, 457-458, 673-674 
Two-sample ¢ test, 437, 458-460, 488-491, 572, 606-607, 649-653 
Type I error, 366-367, 375-377, 608 
Type I error, 366-369, 419-420 
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Unbiased estimator, 313-316 

Uniform distribution, 131, 166-168, 199, 249-250, 268, 331, 
374-375, 379-382, 407 

Union, 21 


Variance (see also Sample variance; Testing) 
computing formula, 157 
confidence interval, 412, 567 
definition, 156 
in hypothesis tests, 415, 471-472, 567-568 
lower bound (Cramér-Rao), 320-322 
properties, 158 
of a sum, 189-190, 612 

Venn diagrams, 25-26, 29, 35 


Weak law of large numbers, 333 

Weibull distribution, 292 

Wilcoxon rank sum test, 673-676 

Wilcoxon signed rank test, 662-672, 693-694, 720-721 


Z transformation (see Normal distribution) 
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